Back to Blog

Error Handling in Agentic Systems: Patterns That Don't Break at Scale

Agent errors aren't like code errors. They're probabilistic, context-dependent, and cascading. Here's how to handle them.

·12 min read

Why Agent Errors Are Different

Traditional software errors:

  • Deterministic (same input, same error)
  • Traceable (clear stack trace)
  • Binary (works or doesn't)

Agent errors:

  • Probabilistic (might fail sometimes)
  • Context-dependent (fails with certain inputs)
  • Graceful degradation possible (partial success)
  • Cascading potential (one failure triggers many)

You can't just try/catch your way out of this.


The Error Taxonomy

Type 1: Tool Errors

The tools your agent calls fail.

Agent calls API → API returns 500 → Agent stuck

Characteristics: External, detectable, recoverable

Handling: Retry, fallback, timeout

Type 2: Reasoning Errors

The agent makes a bad decision.

Agent misinterprets request → Calls wrong tool → Wrong result

Characteristics: Internal, hard to detect, context-dependent

Handling: Validation, guardrails, human review

Type 3: Context Errors

The agent lacks necessary information.

Agent queries KB → Missing data → Hallucinates answer

Characteristics: Gap in knowledge, may be silent

Handling: Confidence scoring, explicit "unknown" handling

Type 4: Cascading Errors

One error causes chain reaction.

Agent A fails → Agent B uses bad output → Agent C amplifies → System chaos

Characteristics: Multi-agent, exponential damage

Handling: Circuit breakers, isolation, rollback


Pattern 1: Graceful Degradation

Don't fail completely when you can partially succeed.

async function generateBriefing(customerId) {
  const sections = {
    overview: null,
    recentActivity: null,
    openIssues: null,
    opportunities: null,
  };

  // Each section independent
  try {
    sections.overview = await getCustomerOverview(customerId);
  } catch (e) {
    sections.overview = "Overview unavailable";
    log.warn("Overview failed", e);
  }

  try {
    sections.recentActivity = await getRecentActivity(customerId);
  } catch (e) {
    sections.recentActivity = "Recent activity unavailable";
    log.warn("Activity failed", e);
  }

  // ... continue for each section

  // Return partial result instead of failing completely
  const availableSections = Object.values(sections).filter(
    (s) => !s.includes("unavailable")
  );

  return {
    ...sections,
    completeness: availableSections.length / Object.keys(sections).length,
  };
}

User sees:

"Briefing for Acme Corp (75% complete)

Overview: [content]
Recent Activity: [content]
Open Issues: Information unavailable
Opportunities: [content]"

Better than: "Error: Failed to generate briefing"


Pattern 2: Retry with Backoff

Network and API errors often resolve themselves.

async function reliableToolCall(tool, params, maxRetries = 3) {
  let lastError;

  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      return await tool(params);
    } catch (error) {
      lastError = error;

      if (!isRetryable(error)) {
        throw error; // Don't retry permanent failures
      }

      if (attempt < maxRetries) {
        const delay = Math.pow(2, attempt) * 1000; // Exponential: 2s, 4s, 8s
        await sleep(delay);
        log.info(`Retry ${attempt}/${maxRetries} after ${delay}ms`);
      }
    }
  }

  throw new Error(`Failed after ${maxRetries} retries: ${lastError.message}`);
}

function isRetryable(error) {
  const retryableCodes = [408, 429, 500, 502, 503, 504];
  return (
    retryableCodes.includes(error.status) ||
    error.code === "ECONNRESET" ||
    error.code === "ETIMEDOUT"
  );
}

Pattern 3: Fallback Chains

When primary method fails, try alternatives.

async function getCustomerContext(customerId) {
  // Primary: Full context from knowledge base
  try {
    return await getFullContext(customerId);
  } catch (e) {
    log.warn("Full context failed, trying summary", e);
  }

  // Fallback 1: Cached summary
  try {
    return await getCachedSummary(customerId);
  } catch (e) {
    log.warn("Summary cache miss, trying basic", e);
  }

  // Fallback 2: Basic info only
  try {
    return await getBasicInfo(customerId);
  } catch (e) {
    log.warn("Basic info failed", e);
  }

  // Final fallback: Explicit unknown
  return {
    customerId,
    status: "unknown",
    context: "No customer information available",
    confidence: 0,
  };
}

Pattern 4: Circuit Breakers

Stop calling failing services before they cascade.

class CircuitBreaker {
  constructor(options = {}) {
    this.failureThreshold = options.failureThreshold || 5;
    this.resetTimeout = options.resetTimeout || 30000;
    this.failures = 0;
    this.state = "CLOSED"; // CLOSED, OPEN, HALF_OPEN
    this.lastFailure = null;
  }

  async call(fn) {
    if (this.state === "OPEN") {
      if (Date.now() - this.lastFailure > this.resetTimeout) {
        this.state = "HALF_OPEN";
      } else {
        throw new Error("Circuit breaker is OPEN");
      }
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  onSuccess() {
    this.failures = 0;
    this.state = "CLOSED";
  }

  onFailure() {
    this.failures++;
    this.lastFailure = Date.now();
    if (this.failures >= this.failureThreshold) {
      this.state = "OPEN";
      log.error("Circuit breaker opened");
    }
  }
}

Pattern 5: Validation Guardrails

Catch reasoning errors before they cause damage.

async function validateAgentAction(action, context) {
  const violations = [];

  // Rule: Don't delete without confirmation
  if (action.type === "delete" && !context.userConfirmed) {
    violations.push("Delete requires user confirmation");
  }

  // Rule: Don't write to production without flag
  if (action.writes && !context.allowWrites) {
    violations.push("Write operations not permitted");
  }

  // Rule: Don't exceed rate limits
  if (context.actionsThisMinute > 10) {
    violations.push("Rate limit exceeded");
  }

  // Rule: Confidence check
  if (action.confidence && action.confidence < 0.7) {
    violations.push(`Low confidence action (${action.confidence})`);
  }

  if (violations.length > 0) {
    return {
      valid: false,
      violations,
      suggestion: "Consider human review",
    };
  }

  return { valid: true };
}

Pattern 6: Confidence Scoring

Make uncertainty explicit.

async function answerQuestion(question) {
  const context = await searchKnowledgeBase(question);

  const response = await agent.generate({
    question,
    context,
    instructions: `
      Answer the question using provided context.
      Include a confidence score 0-100 based on:
      - Relevance of retrieved context
      - Completeness of information
      - Recency of sources
      Format: { "answer": "...", "confidence": N, "sources": [...] }
    `,
  });

  if (response.confidence < 50) {
    return {
      answer: response.answer,
      confidence: response.confidence,
      warning: "Low confidence—consider verifying this information",
    };
  }

  return response;
}

Pattern 7: Rollback Capability

Undo when things go wrong.

class Transaction {
  constructor() {
    this.operations = [];
    this.completed = [];
  }

  addOperation(execute, rollback) {
    this.operations.push({ execute, rollback });
  }

  async commit() {
    for (const op of this.operations) {
      try {
        const result = await op.execute();
        this.completed.push({ op, result });
      } catch (error) {
        await this.rollback();
        throw new Error(`Transaction failed: ${error.message}`);
      }
    }
    return { success: true };
  }

  async rollback() {
    // Rollback in reverse order
    for (const { op, result } of this.completed.reverse()) {
      try {
        await op.rollback(result);
      } catch (e) {
        log.error("Rollback failed", e);
      }
    }
    this.completed = [];
  }
}

Pattern 8: Human Escalation

Know when to stop and ask.

async function handleWithEscalation(task, context) {
  // Attempt automated handling
  try {
    const result = await agent.handle(task);

    // Check if agent is confident
    if (result.confidence > 0.8 && !result.flaggedForReview) {
      return result;
    }

    // Low confidence: queue for review
    return await queueForHumanReview(task, result, {
      reason: result.confidence < 0.8 ? "low_confidence" : "flagged",
      agentSuggestion: result,
    });
  } catch (error) {
    // Error: escalate immediately
    return await escalateToHuman(task, {
      error: error.message,
      context,
      attemptedAt: new Date(),
    });
  }
}

The Error Handling Checklist

For every agent capability:

  • What tools can fail? → Retry + fallback
  • What decisions can be wrong? → Validation + guardrails
  • What context can be missing? → Confidence scoring
  • What can cascade? → Circuit breakers
  • What needs rollback? → Transaction patterns
  • When should humans intervene? → Escalation paths

The Mindset Shift

In traditional code, errors are exceptional.

In agentic systems, errors are expected.

Design for graceful degradation from day one. Your agents will fail—the question is whether they fail safely.

Build Resilient Agents

Xtended provides consistent APIs that fail predictably, making error handling patterns easier to implement. Build agents that degrade gracefully.

Get Started Free