Debugging Agent Failures: A Systematic Approach

The Debugging Challenge

When code fails, you get a stack trace. When an agent fails, you get... a plausible-sounding wrong answer.

Traditional debugging:

Error → Stack trace → Line number → Fix

Agent debugging:

Wrong output → ??? → Many possible causes → Trial and error

You need a systematic approach. Here it is.

The Agent Failure Taxonomy

Category 1: Tool Failures

The agent called the right tool, but the tool failed.

Symptoms:

Error messages in output
Incomplete responses
"I couldn't access..."

Debugging: Check tool logs, API responses, permissions.

Category 2: Tool Selection Failures

The agent called the wrong tool.

Symptoms:

Response addresses different question
Missing expected data
"I found information about X" (when you asked about Y)

Debugging: Review tool descriptions, test tool selection in isolation.

Category 3: Context Failures

The agent lacked necessary information.

Symptoms:

Hallucinated facts
Generic instead of specific answers
"Based on general knowledge..."

Debugging: Check what context was available, verify retrieval.

Category 4: Reasoning Failures

The agent had the right information but drew wrong conclusions.

Symptoms:

Logically inconsistent response
Missed obvious connections
Contradicts provided context

Debugging: Examine the reasoning chain, check for prompt issues.

Category 5: Instruction Failures

The agent didn't follow instructions correctly.

Symptoms:

Wrong format
Missing required elements
Ignored constraints

Debugging: Review system prompt, test instruction compliance.

The Debugging Protocol

Step 1: Reproduce the Failure

Document exactly what happened:

const failureRecord = {
  timestamp: "2024-01-15T10:30:00Z",
  input: "What's our biggest customer's renewal date?",
  expectedOutput: "Acme Corp renews on March 15, 2024",
  actualOutput: "Your biggest customer is Beta Inc with $50k ARR",
  context: { /* snapshot of available context */ },
  model: "claude-sonnet-4-20250514",
  systemPrompt: "...",
};

Can you reproduce it? Agent failures may be probabilistic—run the same input multiple times.

Step 2: Isolate the Category

Ask diagnostic questions:

□ Did the agent call any tools?
  → No tools called: Reasoning or instruction failure
  → Wrong tools called: Tool selection failure
  → Right tools, wrong results: Tool or context failure

□ Did the tool return correct data?
  → No: Tool failure
  → Yes but agent misused it: Reasoning failure

□ Was the correct context available?
  → No: Context failure
  → Yes but agent ignored it: Reasoning failure

□ Did the output follow instructions?
  → No: Instruction failure
  → Yes but wrong answer: Context or reasoning failure

Step 3: Examine the Evidence

For Tool Failures:

// Log tool calls and responses
const toolLog = {
  toolName: "search_customers",
  input: { query: "biggest customer" },
  output: { error: "Connection timeout" },
  duration: 30000,
};

For Context Failures:

// Compare available vs needed context
const contextAudit = {
  needed: ["customer list with ARR", "renewal dates"],
  available: ["customer list with ARR"], // Missing renewal dates!
  retrieved: ["customer list with ARR"],
};

For Reasoning Failures:

// Request chain-of-thought
const debugPrompt = `
${originalPrompt}

Think step by step and explain your reasoning before answering.
`;

Step 4: Test Your Hypothesis

Once you think you know the cause, verify:

// Hypothesis: Agent didn't have renewal date context
// Test: Add renewal dates and rerun

const testContext = {
  ...originalContext,
  renewalDates: { "Acme Corp": "2024-03-15" },
};

const result = await runAgent(originalInput, testContext);
// If correct now, hypothesis confirmed

Step 5: Apply the Fix

Category	Common Fixes
Tool failure	Retry logic, timeout increase, error handling
Tool selection	Improve tool descriptions, add examples
Context failure	Add missing data, improve retrieval
Reasoning failure	Refine prompt, add constraints, chain-of-thought
Instruction failure	Clarify instructions, add format examples

Debugging Tools

The Verbose Logger

class AgentDebugger {
  constructor() {
    this.logs = [];
  }

  log(event, data) {
    this.logs.push({
      timestamp: Date.now(),
      event,
      data,
    });
  }

  async runWithLogging(agent, input) {
    this.log("input", input);

    const context = await agent.getContext(input);
    this.log("context_retrieved", context);

    const toolCalls = [];
    agent.onToolCall = (name, params, result) => {
      toolCalls.push({ name, params, result });
      this.log("tool_call", { name, params, result });
    };

    const output = await agent.run(input);
    this.log("output", output);

    return {
      output,
      logs: this.logs,
      toolCalls,
    };
  }

  analyze() {
    return {
      totalToolCalls: this.logs.filter((l) => l.event === "tool_call").length,
      contextSize: JSON.stringify(
        this.logs.find((l) => l.event === "context_retrieved")?.data
      ).length,
      timeline: this.logs.map((l) => ({
        event: l.event,
        time: l.timestamp - this.logs[0].timestamp,
      })),
    };
  }
}

The Comparison Test

async function compareResponses(input, variations) {
  const results = [];

  for (const variation of variations) {
    const result = await runAgent(input, variation.context, variation.prompt);
    results.push({
      variation: variation.name,
      output: result,
      correct: evaluateCorrectness(result),
    });
  }

  return results;
}

// Usage
const comparison = await compareResponses(
  "What's our biggest customer's renewal date?",
  [
    { name: "baseline", context: originalContext, prompt: originalPrompt },
    { name: "with_renewals", context: contextWithRenewals, prompt: originalPrompt },
    { name: "explicit_instructions", context: originalContext, prompt: explicitPrompt },
  ]
);

Common Failure Patterns

Pattern: The Confident Hallucination

Symptom: Agent states incorrect facts with certainty.

Cause: Usually context failure—agent lacks information but doesn't admit it.

Fix: Add explicit "if you don't know, say so" instruction. Require source citation.

Pattern: The Lazy Tool Skip

Symptom: Agent answers from general knowledge instead of querying tools.

Cause: Tool description doesn't clearly indicate when to use it.

Fix: Add explicit triggers: "Always use search_customers when asked about specific customers."

Pattern: The Format Drift

Symptom: Output format slowly degrades over time.

Cause: Long conversations lose instruction context.

Fix: Reinforce format requirements in each query, not just system prompt.

Pattern: The Wrong Field

Symptom: Agent writes to or reads from wrong database field.

Cause: Ambiguous field names or missing descriptions.

Fix: Add detailed field descriptions: "mrr = Monthly Recurring Revenue in USD"

Pattern: The Infinite Loop

Symptom: Agent keeps calling tools without making progress.

Cause: Tool returning unhelpful results, agent doesn't know how to proceed.

Fix: Add loop detection, timeout, and explicit "if stuck, say so" instruction.

Building a Debugging Habit

For Every Failure:

Document it. Input, expected, actual, context.
Categorize it. Which of the 5 categories?
Isolate it. What single change fixes it?
Fix it. Apply the minimal fix.
Prevent it. Add a test for this case.

Weekly Review:

const weeklyAnalysis = {
  totalFailures: 23,
  byCategory: {
    tool: 5,
    toolSelection: 3,
    context: 10, // ← Most common
    reasoning: 3,
    instruction: 2,
  },
  rootCauses: [
    "Missing renewal date context (7 failures)",
    "Ambiguous 'status' field (4 failures)",
  ],
  fixesApplied: [
    "Added renewal dates to customer template",
    "Renamed 'status' to 'lifecycle_status' with description",
  ],
};

The Debugging Mindset

Agent failures feel mysterious because the system is non-deterministic. But they have causes, and those causes are discoverable.

The key principles:

Reproduce before fixing. If you can't reproduce, you can't verify a fix.
Isolate variables. Change one thing at a time.
Trust the evidence. Logs don't lie; intuition might.
Fix forward. Every fix should prevent future similar failures.

Debugging agents is a skill. It gets easier with practice.

The Debugging Challenge

The Agent Failure Taxonomy

Category 1: Tool Failures

Category 2: Tool Selection Failures

Category 3: Context Failures

Category 4: Reasoning Failures

Category 5: Instruction Failures

The Debugging Protocol

Step 1: Reproduce the Failure

Step 2: Isolate the Category

Step 3: Examine the Evidence

Step 4: Test Your Hypothesis

Step 5: Apply the Fix

Debugging Tools

The Verbose Logger

The Comparison Test

Common Failure Patterns

Pattern: The Confident Hallucination

Pattern: The Lazy Tool Skip

Pattern: The Format Drift

Pattern: The Wrong Field

Pattern: The Infinite Loop

Building a Debugging Habit

For Every Failure:

Weekly Review:

The Debugging Mindset

Debug with Confidence