Back to Blog

The Agent Testing Pyramid: Unit, Integration, and Comprehension Tests

Traditional testing frameworks weren't built for non-deterministic AI systems. Here's what works.

·14 min read

The Testing Challenge

Traditional software:

  • Same input → Same output (deterministic)
  • Clear pass/fail criteria
  • Well-established testing patterns

AI agents:

  • Same input → Similar but variable output (non-deterministic)
  • Quality is a spectrum, not binary
  • Testing patterns are still emerging

You can't apply traditional testing directly. But you can't ship untested agents either.


The Agent Testing Pyramid

The Agent Testing PyramidE2EComprehensionIntegrationUnit TestsE2E TestsFew, slow, expensiveComplete workflowsComprehension TestsDoes agent understand?NEW: Agent-specificIntegration TestsAgent-tool interfaceTool selection, paramsUnit TestsMany, fast, cheapTool functions onlySlowExpensiveRealisticFastCheapDeterministicComprehension Tests: The Agent DifferenceDoes the agent understand your schema, domain, and instructions?

Four levels, each serving a purpose:

  1. Unit tests: Individual tool functions work correctly
  2. Integration tests: Agent interacts with tools properly
  3. Comprehension tests: Agent understands context and instructions
  4. End-to-end tests: Complete workflows produce acceptable results

Level 1: Unit Tests

What you're testing: The tools your agent uses, independent of the AI.

describe("createEntry tool", () => {
  test("creates entry with valid data", async () => {
    const result = await createEntry({
      template_id: "123",
      metadata: { title: "Test", status: "active" },
    });

    expect(result.success).toBe(true);
    expect(result.entry.id).toBeDefined();
  });

  test("rejects entry with missing required field", async () => {
    const result = await createEntry({
      template_id: "123",
      metadata: { status: "active" }, // missing required 'title'
    });

    expect(result.success).toBe(false);
    expect(result.error).toContain("title");
  });
});

Characteristics:

  • Deterministic (no AI involved)
  • Fast (milliseconds)
  • High coverage
  • Traditional testing patterns apply

What unit tests catch:

  • API contract violations
  • Data validation errors
  • Tool implementation bugs
  • Edge cases in utilities

Level 2: Integration Tests

What you're testing: Agent correctly calls tools given specific scenarios.

describe("agent tool calling", () => {
  test("agent calls search when asked to find information", async () => {
    const toolCalls = [];
    const mockToolHandler = (name, params) => {
      toolCalls.push({ name, params });
      return { results: [{ id: "1", title: "Test Entry" }] };
    };

    await runAgent(
      "Find entries about customer onboarding",
      mockToolHandler
    );

    expect(toolCalls.some((c) => c.name === "search_entries")).toBe(true);
    expect(toolCalls[0].params.query).toContain("onboarding");
  });

  test("agent calls create when asked to add information", async () => {
    const toolCalls = [];
    const mockToolHandler = (name, params) => {
      toolCalls.push({ name, params });
      return { success: true };
    };

    await runAgent(
      "Add a new customer segment called Enterprise",
      mockToolHandler
    );

    expect(toolCalls.some((c) => c.name === "create_entry")).toBe(true);
  });
});

Characteristics:

  • Somewhat non-deterministic (AI chooses tools)
  • Moderate speed (API calls involved)
  • Tests agent-tool interface
  • May need retry logic for flaky assertions

What integration tests catch:

  • Wrong tool selection
  • Incorrect parameter formatting
  • Missing tool calls
  • Tool sequencing errors

Level 3: Comprehension Tests

What you're testing: Does the agent understand your schema, your data, your domain?

This is the new layer that traditional testing doesn't address.

describe("agent schema comprehension", () => {
  test("agent understands customer segment template", async () => {
    const response = await runAgent(
      "Describe what the Customer Segments template is for and what each field means."
    );

    expect(response).toContain("customer segment");
    expect(response).toContain("target market");
    expect(response).toContain("pain points");
    expect(response).not.toContain("I don't have information");
  });

  test("agent correctly interprets status field values", async () => {
    const response = await runAgent(
      "What does it mean when a customer has status 'at_risk'?"
    );

    expect(response.toLowerCase()).toContain("concern");
    expect(response.toLowerCase()).toMatch(/churn|retain|attention/);
  });

  test("agent understands field relationships", async () => {
    const response = await runAgent(
      "How are customer segments related to product features?"
    );

    expect(response).toContain("target");
    expect(response).toMatch(/segment.*feature|feature.*segment/);
  });
});

Characteristics:

  • Highly non-deterministic
  • Tests understanding, not just function
  • Assertions are semantic, not exact
  • May need fuzzy matching or embeddings

What comprehension tests catch:

  • Poor schema descriptions
  • Ambiguous field names
  • Missing context
  • Misunderstanding of domain concepts

Level 4: End-to-End Tests

What you're testing: Complete workflows produce acceptable business outcomes.

describe("e2e: customer briefing workflow", () => {
  beforeAll(async () => {
    // Setup realistic data
    await seedTestData({
      customers: 10,
      interactions: 50,
      features: 20,
    });
  });

  test("generates useful customer briefing", async () => {
    const briefing = await runAgent(
      "Give me a briefing on Acme Corp before my call with them tomorrow"
    );

    // Structure checks
    expect(briefing).toContain("Acme Corp");
    expect(briefing.length).toBeGreaterThan(200);
    expect(briefing.length).toBeLessThan(2000);

    // Content checks
    const sections = ["overview", "recent", "issues", "opportunities"];
    const sectionCount = sections.filter((s) =>
      briefing.toLowerCase().includes(s)
    ).length;
    expect(sectionCount).toBeGreaterThanOrEqual(2);

    // No hallucination check
    expect(briefing).not.toContain("I made up");
    expect(briefing).not.toContain("[PLACEHOLDER]");
  });

  test("handles missing customer gracefully", async () => {
    const briefing = await runAgent(
      "Brief me on NonExistentCorp"
    );

    expect(briefing).toContain("no information");
    expect(briefing).not.toContain("NonExistentCorp is a");
  });
});

Characteristics:

  • Most expensive (time, tokens, infrastructure)
  • Most realistic
  • Hardest to maintain
  • Most valuable for critical paths

Assertion Strategies for Non-Deterministic Output

Contains Assertion

expect(response).toContain("customer");

Use for: Required elements that must appear

Pattern Matching

expect(response).toMatch(/\d+ customers/);

Use for: Structured data within text

Semantic Similarity

const similarity = await cosineSimilarity(
  embed(response),
  embed(expectedMeaning)
);
expect(similarity).toBeGreaterThan(0.8);

Use for: Meaning without exact wording

LLM-as-Judge

const evaluation = await runEvaluator(
  `Does this response adequately brief someone on a customer?
   Response: ${response}
   Return JSON: { adequate: boolean, reasoning: string }`
);
expect(evaluation.adequate).toBe(true);

Use for: Complex quality judgments

Negative Assertions

expect(response).not.toContain("I don't know");
expect(response.length).toBeLessThan(5000);

Use for: Preventing known failure modes


CI/CD Considerations

Run Unit Tests Always

Every commit. Fast, cheap, deterministic.

Run Integration Tests on PR

Before merge. Moderate cost.

Run Comprehension Tests Daily

Or on significant prompt/schema changes. Catch understanding drift.

Run E2E Tests on Release

Before deploy. Full validation of critical paths.

Budget for Flakiness

Agent tests are inherently flakier. Set retry policies:

test:
  retry: 2
  timeout: 30s

Start Here

  1. Get unit tests to 90%+ — Your foundation
  2. Add integration tests for each tool — Verify agent-tool interface
  3. Create comprehension tests for core templates — The differentiator
  4. Build E2E tests for critical workflows — Your safety net

Testing agents is harder than testing traditional software. But untested agents are untrustworthy agents. Build the pyramid.

Test Your Agents

Xtended's consistent API makes integration testing straightforward. Predictable responses help you build reliable agent test suites.

Get Started Free