Debugging Agent Failures: A Systematic Approach
Agent failures aren't like code bugs. They're contextual, probabilistic, and often subtle. Here's how to diagnose them.
The Debugging Challenge
When code fails, you get a stack trace. When an agent fails, you get... a plausible-sounding wrong answer.
Traditional debugging:
Error → Stack trace → Line number → FixAgent debugging:
Wrong output → ??? → Many possible causes → Trial and errorYou need a systematic approach. Here it is.
The Agent Failure Taxonomy
Category 1: Tool Failures
The agent called the right tool, but the tool failed.
Symptoms:
- Error messages in output
- Incomplete responses
- "I couldn't access..."
Debugging: Check tool logs, API responses, permissions.
Category 2: Tool Selection Failures
The agent called the wrong tool.
Symptoms:
- Response addresses different question
- Missing expected data
- "I found information about X" (when you asked about Y)
Debugging: Review tool descriptions, test tool selection in isolation.
Category 3: Context Failures
The agent lacked necessary information.
Symptoms:
- Hallucinated facts
- Generic instead of specific answers
- "Based on general knowledge..."
Debugging: Check what context was available, verify retrieval.
Category 4: Reasoning Failures
The agent had the right information but drew wrong conclusions.
Symptoms:
- Logically inconsistent response
- Missed obvious connections
- Contradicts provided context
Debugging: Examine the reasoning chain, check for prompt issues.
Category 5: Instruction Failures
The agent didn't follow instructions correctly.
Symptoms:
- Wrong format
- Missing required elements
- Ignored constraints
Debugging: Review system prompt, test instruction compliance.
The Debugging Protocol
Step 1: Reproduce the Failure
Document exactly what happened:
const failureRecord = {
timestamp: "2024-01-15T10:30:00Z",
input: "What's our biggest customer's renewal date?",
expectedOutput: "Acme Corp renews on March 15, 2024",
actualOutput: "Your biggest customer is Beta Inc with $50k ARR",
context: { /* snapshot of available context */ },
model: "claude-sonnet-4-20250514",
systemPrompt: "...",
};Can you reproduce it? Agent failures may be probabilistic—run the same input multiple times.
Step 2: Isolate the Category
Ask diagnostic questions:
□ Did the agent call any tools?
→ No tools called: Reasoning or instruction failure
→ Wrong tools called: Tool selection failure
→ Right tools, wrong results: Tool or context failure
□ Did the tool return correct data?
→ No: Tool failure
→ Yes but agent misused it: Reasoning failure
□ Was the correct context available?
→ No: Context failure
→ Yes but agent ignored it: Reasoning failure
□ Did the output follow instructions?
→ No: Instruction failure
→ Yes but wrong answer: Context or reasoning failureStep 3: Examine the Evidence
For Tool Failures:
// Log tool calls and responses
const toolLog = {
toolName: "search_customers",
input: { query: "biggest customer" },
output: { error: "Connection timeout" },
duration: 30000,
};For Context Failures:
// Compare available vs needed context
const contextAudit = {
needed: ["customer list with ARR", "renewal dates"],
available: ["customer list with ARR"], // Missing renewal dates!
retrieved: ["customer list with ARR"],
};For Reasoning Failures:
// Request chain-of-thought
const debugPrompt = `
${originalPrompt}
Think step by step and explain your reasoning before answering.
`;Step 4: Test Your Hypothesis
Once you think you know the cause, verify:
// Hypothesis: Agent didn't have renewal date context
// Test: Add renewal dates and rerun
const testContext = {
...originalContext,
renewalDates: { "Acme Corp": "2024-03-15" },
};
const result = await runAgent(originalInput, testContext);
// If correct now, hypothesis confirmedStep 5: Apply the Fix
| Category | Common Fixes |
|---|---|
| Tool failure | Retry logic, timeout increase, error handling |
| Tool selection | Improve tool descriptions, add examples |
| Context failure | Add missing data, improve retrieval |
| Reasoning failure | Refine prompt, add constraints, chain-of-thought |
| Instruction failure | Clarify instructions, add format examples |
Debugging Tools
The Verbose Logger
class AgentDebugger {
constructor() {
this.logs = [];
}
log(event, data) {
this.logs.push({
timestamp: Date.now(),
event,
data,
});
}
async runWithLogging(agent, input) {
this.log("input", input);
const context = await agent.getContext(input);
this.log("context_retrieved", context);
const toolCalls = [];
agent.onToolCall = (name, params, result) => {
toolCalls.push({ name, params, result });
this.log("tool_call", { name, params, result });
};
const output = await agent.run(input);
this.log("output", output);
return {
output,
logs: this.logs,
toolCalls,
};
}
analyze() {
return {
totalToolCalls: this.logs.filter((l) => l.event === "tool_call").length,
contextSize: JSON.stringify(
this.logs.find((l) => l.event === "context_retrieved")?.data
).length,
timeline: this.logs.map((l) => ({
event: l.event,
time: l.timestamp - this.logs[0].timestamp,
})),
};
}
}The Comparison Test
async function compareResponses(input, variations) {
const results = [];
for (const variation of variations) {
const result = await runAgent(input, variation.context, variation.prompt);
results.push({
variation: variation.name,
output: result,
correct: evaluateCorrectness(result),
});
}
return results;
}
// Usage
const comparison = await compareResponses(
"What's our biggest customer's renewal date?",
[
{ name: "baseline", context: originalContext, prompt: originalPrompt },
{ name: "with_renewals", context: contextWithRenewals, prompt: originalPrompt },
{ name: "explicit_instructions", context: originalContext, prompt: explicitPrompt },
]
);Common Failure Patterns
Pattern: The Confident Hallucination
Symptom: Agent states incorrect facts with certainty.
Cause: Usually context failure—agent lacks information but doesn't admit it.
Fix: Add explicit "if you don't know, say so" instruction. Require source citation.
Pattern: The Lazy Tool Skip
Symptom: Agent answers from general knowledge instead of querying tools.
Cause: Tool description doesn't clearly indicate when to use it.
Fix: Add explicit triggers: "Always use search_customers when asked about specific customers."
Pattern: The Format Drift
Symptom: Output format slowly degrades over time.
Cause: Long conversations lose instruction context.
Fix: Reinforce format requirements in each query, not just system prompt.
Pattern: The Wrong Field
Symptom: Agent writes to or reads from wrong database field.
Cause: Ambiguous field names or missing descriptions.
Fix: Add detailed field descriptions: "mrr = Monthly Recurring Revenue in USD"
Pattern: The Infinite Loop
Symptom: Agent keeps calling tools without making progress.
Cause: Tool returning unhelpful results, agent doesn't know how to proceed.
Fix: Add loop detection, timeout, and explicit "if stuck, say so" instruction.
Building a Debugging Habit
For Every Failure:
- Document it. Input, expected, actual, context.
- Categorize it. Which of the 5 categories?
- Isolate it. What single change fixes it?
- Fix it. Apply the minimal fix.
- Prevent it. Add a test for this case.
Weekly Review:
const weeklyAnalysis = {
totalFailures: 23,
byCategory: {
tool: 5,
toolSelection: 3,
context: 10, // ← Most common
reasoning: 3,
instruction: 2,
},
rootCauses: [
"Missing renewal date context (7 failures)",
"Ambiguous 'status' field (4 failures)",
],
fixesApplied: [
"Added renewal dates to customer template",
"Renamed 'status' to 'lifecycle_status' with description",
],
};The Debugging Mindset
Agent failures feel mysterious because the system is non-deterministic. But they have causes, and those causes are discoverable.
The key principles:
- Reproduce before fixing. If you can't reproduce, you can't verify a fix.
- Isolate variables. Change one thing at a time.
- Trust the evidence. Logs don't lie; intuition might.
- Fix forward. Every fix should prevent future similar failures.
Debugging agents is a skill. It gets easier with practice.
Debug with Confidence
Xtended's structured data and consistent APIs make debugging easier. When your knowledge base is organized, you can quickly identify what context was available versus what was needed.
Get Started Free