Error Handling in Agentic Systems: Patterns That Don't Break at Scale
Agent errors aren't like code errors. They're probabilistic, context-dependent, and cascading. Here's how to handle them.
Why Agent Errors Are Different
Traditional software errors:
- Deterministic (same input, same error)
- Traceable (clear stack trace)
- Binary (works or doesn't)
Agent errors:
- Probabilistic (might fail sometimes)
- Context-dependent (fails with certain inputs)
- Graceful degradation possible (partial success)
- Cascading potential (one failure triggers many)
You can't just try/catch your way out of this.
The Error Taxonomy
Type 1: Tool Errors
The tools your agent calls fail.
Agent calls API → API returns 500 → Agent stuckCharacteristics: External, detectable, recoverable
Handling: Retry, fallback, timeout
Type 2: Reasoning Errors
The agent makes a bad decision.
Agent misinterprets request → Calls wrong tool → Wrong resultCharacteristics: Internal, hard to detect, context-dependent
Handling: Validation, guardrails, human review
Type 3: Context Errors
The agent lacks necessary information.
Agent queries KB → Missing data → Hallucinates answerCharacteristics: Gap in knowledge, may be silent
Handling: Confidence scoring, explicit "unknown" handling
Type 4: Cascading Errors
One error causes chain reaction.
Agent A fails → Agent B uses bad output → Agent C amplifies → System chaosCharacteristics: Multi-agent, exponential damage
Handling: Circuit breakers, isolation, rollback
Pattern 1: Graceful Degradation
Don't fail completely when you can partially succeed.
async function generateBriefing(customerId) {
const sections = {
overview: null,
recentActivity: null,
openIssues: null,
opportunities: null,
};
// Each section independent
try {
sections.overview = await getCustomerOverview(customerId);
} catch (e) {
sections.overview = "Overview unavailable";
log.warn("Overview failed", e);
}
try {
sections.recentActivity = await getRecentActivity(customerId);
} catch (e) {
sections.recentActivity = "Recent activity unavailable";
log.warn("Activity failed", e);
}
// ... continue for each section
// Return partial result instead of failing completely
const availableSections = Object.values(sections).filter(
(s) => !s.includes("unavailable")
);
return {
...sections,
completeness: availableSections.length / Object.keys(sections).length,
};
}User sees:
"Briefing for Acme Corp (75% complete)
Overview: [content]
Recent Activity: [content]
Open Issues: Information unavailable
Opportunities: [content]"
Better than: "Error: Failed to generate briefing"
Pattern 2: Retry with Backoff
Network and API errors often resolve themselves.
async function reliableToolCall(tool, params, maxRetries = 3) {
let lastError;
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
return await tool(params);
} catch (error) {
lastError = error;
if (!isRetryable(error)) {
throw error; // Don't retry permanent failures
}
if (attempt < maxRetries) {
const delay = Math.pow(2, attempt) * 1000; // Exponential: 2s, 4s, 8s
await sleep(delay);
log.info(`Retry ${attempt}/${maxRetries} after ${delay}ms`);
}
}
}
throw new Error(`Failed after ${maxRetries} retries: ${lastError.message}`);
}
function isRetryable(error) {
const retryableCodes = [408, 429, 500, 502, 503, 504];
return (
retryableCodes.includes(error.status) ||
error.code === "ECONNRESET" ||
error.code === "ETIMEDOUT"
);
}Pattern 3: Fallback Chains
When primary method fails, try alternatives.
async function getCustomerContext(customerId) {
// Primary: Full context from knowledge base
try {
return await getFullContext(customerId);
} catch (e) {
log.warn("Full context failed, trying summary", e);
}
// Fallback 1: Cached summary
try {
return await getCachedSummary(customerId);
} catch (e) {
log.warn("Summary cache miss, trying basic", e);
}
// Fallback 2: Basic info only
try {
return await getBasicInfo(customerId);
} catch (e) {
log.warn("Basic info failed", e);
}
// Final fallback: Explicit unknown
return {
customerId,
status: "unknown",
context: "No customer information available",
confidence: 0,
};
}Pattern 4: Circuit Breakers
Stop calling failing services before they cascade.
class CircuitBreaker {
constructor(options = {}) {
this.failureThreshold = options.failureThreshold || 5;
this.resetTimeout = options.resetTimeout || 30000;
this.failures = 0;
this.state = "CLOSED"; // CLOSED, OPEN, HALF_OPEN
this.lastFailure = null;
}
async call(fn) {
if (this.state === "OPEN") {
if (Date.now() - this.lastFailure > this.resetTimeout) {
this.state = "HALF_OPEN";
} else {
throw new Error("Circuit breaker is OPEN");
}
}
try {
const result = await fn();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}
onSuccess() {
this.failures = 0;
this.state = "CLOSED";
}
onFailure() {
this.failures++;
this.lastFailure = Date.now();
if (this.failures >= this.failureThreshold) {
this.state = "OPEN";
log.error("Circuit breaker opened");
}
}
}Pattern 5: Validation Guardrails
Catch reasoning errors before they cause damage.
async function validateAgentAction(action, context) {
const violations = [];
// Rule: Don't delete without confirmation
if (action.type === "delete" && !context.userConfirmed) {
violations.push("Delete requires user confirmation");
}
// Rule: Don't write to production without flag
if (action.writes && !context.allowWrites) {
violations.push("Write operations not permitted");
}
// Rule: Don't exceed rate limits
if (context.actionsThisMinute > 10) {
violations.push("Rate limit exceeded");
}
// Rule: Confidence check
if (action.confidence && action.confidence < 0.7) {
violations.push(`Low confidence action (${action.confidence})`);
}
if (violations.length > 0) {
return {
valid: false,
violations,
suggestion: "Consider human review",
};
}
return { valid: true };
}Pattern 6: Confidence Scoring
Make uncertainty explicit.
async function answerQuestion(question) {
const context = await searchKnowledgeBase(question);
const response = await agent.generate({
question,
context,
instructions: `
Answer the question using provided context.
Include a confidence score 0-100 based on:
- Relevance of retrieved context
- Completeness of information
- Recency of sources
Format: { "answer": "...", "confidence": N, "sources": [...] }
`,
});
if (response.confidence < 50) {
return {
answer: response.answer,
confidence: response.confidence,
warning: "Low confidence—consider verifying this information",
};
}
return response;
}Pattern 7: Rollback Capability
Undo when things go wrong.
class Transaction {
constructor() {
this.operations = [];
this.completed = [];
}
addOperation(execute, rollback) {
this.operations.push({ execute, rollback });
}
async commit() {
for (const op of this.operations) {
try {
const result = await op.execute();
this.completed.push({ op, result });
} catch (error) {
await this.rollback();
throw new Error(`Transaction failed: ${error.message}`);
}
}
return { success: true };
}
async rollback() {
// Rollback in reverse order
for (const { op, result } of this.completed.reverse()) {
try {
await op.rollback(result);
} catch (e) {
log.error("Rollback failed", e);
}
}
this.completed = [];
}
}Pattern 8: Human Escalation
Know when to stop and ask.
async function handleWithEscalation(task, context) {
// Attempt automated handling
try {
const result = await agent.handle(task);
// Check if agent is confident
if (result.confidence > 0.8 && !result.flaggedForReview) {
return result;
}
// Low confidence: queue for review
return await queueForHumanReview(task, result, {
reason: result.confidence < 0.8 ? "low_confidence" : "flagged",
agentSuggestion: result,
});
} catch (error) {
// Error: escalate immediately
return await escalateToHuman(task, {
error: error.message,
context,
attemptedAt: new Date(),
});
}
}The Error Handling Checklist
For every agent capability:
- What tools can fail? → Retry + fallback
- What decisions can be wrong? → Validation + guardrails
- What context can be missing? → Confidence scoring
- What can cascade? → Circuit breakers
- What needs rollback? → Transaction patterns
- When should humans intervene? → Escalation paths
The Mindset Shift
In traditional code, errors are exceptional.
In agentic systems, errors are expected.
Design for graceful degradation from day one. Your agents will fail—the question is whether they fail safely.
Build Resilient Agents
Xtended provides consistent APIs that fail predictably, making error handling patterns easier to implement. Build agents that degrade gracefully.
Get Started Free