Back to Blog

Debugging Agent Failures: A Systematic Approach

Agent failures aren't like code bugs. They're contextual, probabilistic, and often subtle. Here's how to diagnose them.

·11 min read

The Debugging Challenge

When code fails, you get a stack trace. When an agent fails, you get... a plausible-sounding wrong answer.

Traditional debugging:

Error → Stack trace → Line number → Fix

Agent debugging:

Wrong output → ??? → Many possible causes → Trial and error

You need a systematic approach. Here it is.


The Agent Failure Taxonomy

Category 1: Tool Failures

The agent called the right tool, but the tool failed.

Symptoms:

  • Error messages in output
  • Incomplete responses
  • "I couldn't access..."

Debugging: Check tool logs, API responses, permissions.

Category 2: Tool Selection Failures

The agent called the wrong tool.

Symptoms:

  • Response addresses different question
  • Missing expected data
  • "I found information about X" (when you asked about Y)

Debugging: Review tool descriptions, test tool selection in isolation.

Category 3: Context Failures

The agent lacked necessary information.

Symptoms:

  • Hallucinated facts
  • Generic instead of specific answers
  • "Based on general knowledge..."

Debugging: Check what context was available, verify retrieval.

Category 4: Reasoning Failures

The agent had the right information but drew wrong conclusions.

Symptoms:

  • Logically inconsistent response
  • Missed obvious connections
  • Contradicts provided context

Debugging: Examine the reasoning chain, check for prompt issues.

Category 5: Instruction Failures

The agent didn't follow instructions correctly.

Symptoms:

  • Wrong format
  • Missing required elements
  • Ignored constraints

Debugging: Review system prompt, test instruction compliance.


The Debugging Protocol

Step 1: Reproduce the Failure

Document exactly what happened:

const failureRecord = {
  timestamp: "2024-01-15T10:30:00Z",
  input: "What's our biggest customer's renewal date?",
  expectedOutput: "Acme Corp renews on March 15, 2024",
  actualOutput: "Your biggest customer is Beta Inc with $50k ARR",
  context: { /* snapshot of available context */ },
  model: "claude-sonnet-4-20250514",
  systemPrompt: "...",
};

Can you reproduce it? Agent failures may be probabilistic—run the same input multiple times.

Step 2: Isolate the Category

Ask diagnostic questions:

□ Did the agent call any tools?
  → No tools called: Reasoning or instruction failure
  → Wrong tools called: Tool selection failure
  → Right tools, wrong results: Tool or context failure

□ Did the tool return correct data?
  → No: Tool failure
  → Yes but agent misused it: Reasoning failure

□ Was the correct context available?
  → No: Context failure
  → Yes but agent ignored it: Reasoning failure

□ Did the output follow instructions?
  → No: Instruction failure
  → Yes but wrong answer: Context or reasoning failure

Step 3: Examine the Evidence

For Tool Failures:

// Log tool calls and responses
const toolLog = {
  toolName: "search_customers",
  input: { query: "biggest customer" },
  output: { error: "Connection timeout" },
  duration: 30000,
};

For Context Failures:

// Compare available vs needed context
const contextAudit = {
  needed: ["customer list with ARR", "renewal dates"],
  available: ["customer list with ARR"], // Missing renewal dates!
  retrieved: ["customer list with ARR"],
};

For Reasoning Failures:

// Request chain-of-thought
const debugPrompt = `
${originalPrompt}

Think step by step and explain your reasoning before answering.
`;

Step 4: Test Your Hypothesis

Once you think you know the cause, verify:

// Hypothesis: Agent didn't have renewal date context
// Test: Add renewal dates and rerun

const testContext = {
  ...originalContext,
  renewalDates: { "Acme Corp": "2024-03-15" },
};

const result = await runAgent(originalInput, testContext);
// If correct now, hypothesis confirmed

Step 5: Apply the Fix

CategoryCommon Fixes
Tool failureRetry logic, timeout increase, error handling
Tool selectionImprove tool descriptions, add examples
Context failureAdd missing data, improve retrieval
Reasoning failureRefine prompt, add constraints, chain-of-thought
Instruction failureClarify instructions, add format examples

Debugging Tools

The Verbose Logger

class AgentDebugger {
  constructor() {
    this.logs = [];
  }

  log(event, data) {
    this.logs.push({
      timestamp: Date.now(),
      event,
      data,
    });
  }

  async runWithLogging(agent, input) {
    this.log("input", input);

    const context = await agent.getContext(input);
    this.log("context_retrieved", context);

    const toolCalls = [];
    agent.onToolCall = (name, params, result) => {
      toolCalls.push({ name, params, result });
      this.log("tool_call", { name, params, result });
    };

    const output = await agent.run(input);
    this.log("output", output);

    return {
      output,
      logs: this.logs,
      toolCalls,
    };
  }

  analyze() {
    return {
      totalToolCalls: this.logs.filter((l) => l.event === "tool_call").length,
      contextSize: JSON.stringify(
        this.logs.find((l) => l.event === "context_retrieved")?.data
      ).length,
      timeline: this.logs.map((l) => ({
        event: l.event,
        time: l.timestamp - this.logs[0].timestamp,
      })),
    };
  }
}

The Comparison Test

async function compareResponses(input, variations) {
  const results = [];

  for (const variation of variations) {
    const result = await runAgent(input, variation.context, variation.prompt);
    results.push({
      variation: variation.name,
      output: result,
      correct: evaluateCorrectness(result),
    });
  }

  return results;
}

// Usage
const comparison = await compareResponses(
  "What's our biggest customer's renewal date?",
  [
    { name: "baseline", context: originalContext, prompt: originalPrompt },
    { name: "with_renewals", context: contextWithRenewals, prompt: originalPrompt },
    { name: "explicit_instructions", context: originalContext, prompt: explicitPrompt },
  ]
);

Common Failure Patterns

Pattern: The Confident Hallucination

Symptom: Agent states incorrect facts with certainty.

Cause: Usually context failure—agent lacks information but doesn't admit it.

Fix: Add explicit "if you don't know, say so" instruction. Require source citation.

Pattern: The Lazy Tool Skip

Symptom: Agent answers from general knowledge instead of querying tools.

Cause: Tool description doesn't clearly indicate when to use it.

Fix: Add explicit triggers: "Always use search_customers when asked about specific customers."

Pattern: The Format Drift

Symptom: Output format slowly degrades over time.

Cause: Long conversations lose instruction context.

Fix: Reinforce format requirements in each query, not just system prompt.

Pattern: The Wrong Field

Symptom: Agent writes to or reads from wrong database field.

Cause: Ambiguous field names or missing descriptions.

Fix: Add detailed field descriptions: "mrr = Monthly Recurring Revenue in USD"

Pattern: The Infinite Loop

Symptom: Agent keeps calling tools without making progress.

Cause: Tool returning unhelpful results, agent doesn't know how to proceed.

Fix: Add loop detection, timeout, and explicit "if stuck, say so" instruction.


Building a Debugging Habit

For Every Failure:

  1. Document it. Input, expected, actual, context.
  2. Categorize it. Which of the 5 categories?
  3. Isolate it. What single change fixes it?
  4. Fix it. Apply the minimal fix.
  5. Prevent it. Add a test for this case.

Weekly Review:

const weeklyAnalysis = {
  totalFailures: 23,
  byCategory: {
    tool: 5,
    toolSelection: 3,
    context: 10, // ← Most common
    reasoning: 3,
    instruction: 2,
  },
  rootCauses: [
    "Missing renewal date context (7 failures)",
    "Ambiguous 'status' field (4 failures)",
  ],
  fixesApplied: [
    "Added renewal dates to customer template",
    "Renamed 'status' to 'lifecycle_status' with description",
  ],
};

The Debugging Mindset

Agent failures feel mysterious because the system is non-deterministic. But they have causes, and those causes are discoverable.

The key principles:

  1. Reproduce before fixing. If you can't reproduce, you can't verify a fix.
  2. Isolate variables. Change one thing at a time.
  3. Trust the evidence. Logs don't lie; intuition might.
  4. Fix forward. Every fix should prevent future similar failures.

Debugging agents is a skill. It gets easier with practice.

Debug with Confidence

Xtended's structured data and consistent APIs make debugging easier. When your knowledge base is organized, you can quickly identify what context was available versus what was needed.

Get Started Free