The Agent Testing Pyramid: Unit, Integration, and Comprehension

The Testing Challenge

Traditional software:

Same input → Same output (deterministic)
Clear pass/fail criteria
Well-established testing patterns

AI agents:

Same input → Similar but variable output (non-deterministic)
Quality is a spectrum, not binary
Testing patterns are still emerging

You can't apply traditional testing directly. But you can't ship untested agents either.

The Agent Testing Pyramid

Four levels, each serving a purpose:

Unit tests: Individual tool functions work correctly
Integration tests: Agent interacts with tools properly
Comprehension tests: Agent understands context and instructions
End-to-end tests: Complete workflows produce acceptable results

Level 1: Unit Tests

What you're testing: The tools your agent uses, independent of the AI.

describe("createEntry tool", () => {
  test("creates entry with valid data", async () => {
    const result = await createEntry({
      template_id: "123",
      metadata: { title: "Test", status: "active" },
    });

    expect(result.success).toBe(true);
    expect(result.entry.id).toBeDefined();
  });

  test("rejects entry with missing required field", async () => {
    const result = await createEntry({
      template_id: "123",
      metadata: { status: "active" }, // missing required 'title'
    });

    expect(result.success).toBe(false);
    expect(result.error).toContain("title");
  });
});

Characteristics:

Deterministic (no AI involved)
Fast (milliseconds)
High coverage
Traditional testing patterns apply

What unit tests catch:

API contract violations
Data validation errors
Tool implementation bugs
Edge cases in utilities

Level 2: Integration Tests

What you're testing: Agent correctly calls tools given specific scenarios.

describe("agent tool calling", () => {
  test("agent calls search when asked to find information", async () => {
    const toolCalls = [];
    const mockToolHandler = (name, params) => {
      toolCalls.push({ name, params });
      return { results: [{ id: "1", title: "Test Entry" }] };
    };

    await runAgent(
      "Find entries about customer onboarding",
      mockToolHandler
    );

    expect(toolCalls.some((c) => c.name === "search_entries")).toBe(true);
    expect(toolCalls[0].params.query).toContain("onboarding");
  });

  test("agent calls create when asked to add information", async () => {
    const toolCalls = [];
    const mockToolHandler = (name, params) => {
      toolCalls.push({ name, params });
      return { success: true };
    };

    await runAgent(
      "Add a new customer segment called Enterprise",
      mockToolHandler
    );

    expect(toolCalls.some((c) => c.name === "create_entry")).toBe(true);
  });
});

Characteristics:

Somewhat non-deterministic (AI chooses tools)
Moderate speed (API calls involved)
Tests agent-tool interface
May need retry logic for flaky assertions

What integration tests catch:

Wrong tool selection
Incorrect parameter formatting
Missing tool calls
Tool sequencing errors

Level 3: Comprehension Tests

What you're testing: Does the agent understand your schema, your data, your domain?

This is the new layer that traditional testing doesn't address.

describe("agent schema comprehension", () => {
  test("agent understands customer segment template", async () => {
    const response = await runAgent(
      "Describe what the Customer Segments template is for and what each field means."
    );

    expect(response).toContain("customer segment");
    expect(response).toContain("target market");
    expect(response).toContain("pain points");
    expect(response).not.toContain("I don't have information");
  });

  test("agent correctly interprets status field values", async () => {
    const response = await runAgent(
      "What does it mean when a customer has status 'at_risk'?"
    );

    expect(response.toLowerCase()).toContain("concern");
    expect(response.toLowerCase()).toMatch(/churn|retain|attention/);
  });

  test("agent understands field relationships", async () => {
    const response = await runAgent(
      "How are customer segments related to product features?"
    );

    expect(response).toContain("target");
    expect(response).toMatch(/segment.*feature|feature.*segment/);
  });
});

Characteristics:

Highly non-deterministic
Tests understanding, not just function
Assertions are semantic, not exact
May need fuzzy matching or embeddings

What comprehension tests catch:

Poor schema descriptions
Ambiguous field names
Missing context
Misunderstanding of domain concepts

Level 4: End-to-End Tests

What you're testing: Complete workflows produce acceptable business outcomes.

describe("e2e: customer briefing workflow", () => {
  beforeAll(async () => {
    // Setup realistic data
    await seedTestData({
      customers: 10,
      interactions: 50,
      features: 20,
    });
  });

  test("generates useful customer briefing", async () => {
    const briefing = await runAgent(
      "Give me a briefing on Acme Corp before my call with them tomorrow"
    );

    // Structure checks
    expect(briefing).toContain("Acme Corp");
    expect(briefing.length).toBeGreaterThan(200);
    expect(briefing.length).toBeLessThan(2000);

    // Content checks
    const sections = ["overview", "recent", "issues", "opportunities"];
    const sectionCount = sections.filter((s) =>
      briefing.toLowerCase().includes(s)
    ).length;
    expect(sectionCount).toBeGreaterThanOrEqual(2);

    // No hallucination check
    expect(briefing).not.toContain("I made up");
    expect(briefing).not.toContain("[PLACEHOLDER]");
  });

  test("handles missing customer gracefully", async () => {
    const briefing = await runAgent(
      "Brief me on NonExistentCorp"
    );

    expect(briefing).toContain("no information");
    expect(briefing).not.toContain("NonExistentCorp is a");
  });
});

Characteristics:

Most expensive (time, tokens, infrastructure)
Most realistic
Hardest to maintain
Most valuable for critical paths

Assertion Strategies for Non-Deterministic Output

Contains Assertion

expect(response).toContain("customer");

Use for: Required elements that must appear

Pattern Matching

expect(response).toMatch(/\d+ customers/);

Use for: Structured data within text

Semantic Similarity

const similarity = await cosineSimilarity(
  embed(response),
  embed(expectedMeaning)
);
expect(similarity).toBeGreaterThan(0.8);

Use for: Meaning without exact wording

LLM-as-Judge

const evaluation = await runEvaluator(
  `Does this response adequately brief someone on a customer?
   Response: ${response}
   Return JSON: { adequate: boolean, reasoning: string }`
);
expect(evaluation.adequate).toBe(true);

Use for: Complex quality judgments

Negative Assertions

expect(response).not.toContain("I don't know");
expect(response.length).toBeLessThan(5000);

Use for: Preventing known failure modes

CI/CD Considerations

Run Unit Tests Always

Every commit. Fast, cheap, deterministic.

Run Integration Tests on PR

Before merge. Moderate cost.

Run Comprehension Tests Daily

Or on significant prompt/schema changes. Catch understanding drift.

Run E2E Tests on Release

Before deploy. Full validation of critical paths.

Budget for Flakiness

Agent tests are inherently flakier. Set retry policies:

test:
  retry: 2
  timeout: 30s

Start Here

Get unit tests to 90%+ — Your foundation
Add integration tests for each tool — Verify agent-tool interface
Create comprehension tests for core templates — The differentiator
Build E2E tests for critical workflows — Your safety net

Testing agents is harder than testing traditional software. But untested agents are untrustworthy agents. Build the pyramid.

The Agent Testing Pyramid: Unit, Integration, and Comprehension Tests

The Testing Challenge

The Agent Testing Pyramid

Level 1: Unit Tests

Level 2: Integration Tests

Level 3: Comprehension Tests

Level 4: End-to-End Tests

Assertion Strategies for Non-Deterministic Output

Contains Assertion

Pattern Matching

Semantic Similarity

LLM-as-Judge

Negative Assertions

CI/CD Considerations

Run Unit Tests Always

Run Integration Tests on PR

Run Comprehension Tests Daily

Run E2E Tests on Release

Budget for Flakiness

Start Here

Test Your Agents