The Agent Testing Pyramid: Unit, Integration, and Comprehension Tests
Traditional testing frameworks weren't built for non-deterministic AI systems. Here's what works.
The Testing Challenge
Traditional software:
- Same input → Same output (deterministic)
- Clear pass/fail criteria
- Well-established testing patterns
AI agents:
- Same input → Similar but variable output (non-deterministic)
- Quality is a spectrum, not binary
- Testing patterns are still emerging
You can't apply traditional testing directly. But you can't ship untested agents either.
The Agent Testing Pyramid
Four levels, each serving a purpose:
- Unit tests: Individual tool functions work correctly
- Integration tests: Agent interacts with tools properly
- Comprehension tests: Agent understands context and instructions
- End-to-end tests: Complete workflows produce acceptable results
Level 1: Unit Tests
What you're testing: The tools your agent uses, independent of the AI.
describe("createEntry tool", () => {
test("creates entry with valid data", async () => {
const result = await createEntry({
template_id: "123",
metadata: { title: "Test", status: "active" },
});
expect(result.success).toBe(true);
expect(result.entry.id).toBeDefined();
});
test("rejects entry with missing required field", async () => {
const result = await createEntry({
template_id: "123",
metadata: { status: "active" }, // missing required 'title'
});
expect(result.success).toBe(false);
expect(result.error).toContain("title");
});
});Characteristics:
- Deterministic (no AI involved)
- Fast (milliseconds)
- High coverage
- Traditional testing patterns apply
What unit tests catch:
- API contract violations
- Data validation errors
- Tool implementation bugs
- Edge cases in utilities
Level 2: Integration Tests
What you're testing: Agent correctly calls tools given specific scenarios.
describe("agent tool calling", () => {
test("agent calls search when asked to find information", async () => {
const toolCalls = [];
const mockToolHandler = (name, params) => {
toolCalls.push({ name, params });
return { results: [{ id: "1", title: "Test Entry" }] };
};
await runAgent(
"Find entries about customer onboarding",
mockToolHandler
);
expect(toolCalls.some((c) => c.name === "search_entries")).toBe(true);
expect(toolCalls[0].params.query).toContain("onboarding");
});
test("agent calls create when asked to add information", async () => {
const toolCalls = [];
const mockToolHandler = (name, params) => {
toolCalls.push({ name, params });
return { success: true };
};
await runAgent(
"Add a new customer segment called Enterprise",
mockToolHandler
);
expect(toolCalls.some((c) => c.name === "create_entry")).toBe(true);
});
});Characteristics:
- Somewhat non-deterministic (AI chooses tools)
- Moderate speed (API calls involved)
- Tests agent-tool interface
- May need retry logic for flaky assertions
What integration tests catch:
- Wrong tool selection
- Incorrect parameter formatting
- Missing tool calls
- Tool sequencing errors
Level 3: Comprehension Tests
What you're testing: Does the agent understand your schema, your data, your domain?
This is the new layer that traditional testing doesn't address.
describe("agent schema comprehension", () => {
test("agent understands customer segment template", async () => {
const response = await runAgent(
"Describe what the Customer Segments template is for and what each field means."
);
expect(response).toContain("customer segment");
expect(response).toContain("target market");
expect(response).toContain("pain points");
expect(response).not.toContain("I don't have information");
});
test("agent correctly interprets status field values", async () => {
const response = await runAgent(
"What does it mean when a customer has status 'at_risk'?"
);
expect(response.toLowerCase()).toContain("concern");
expect(response.toLowerCase()).toMatch(/churn|retain|attention/);
});
test("agent understands field relationships", async () => {
const response = await runAgent(
"How are customer segments related to product features?"
);
expect(response).toContain("target");
expect(response).toMatch(/segment.*feature|feature.*segment/);
});
});Characteristics:
- Highly non-deterministic
- Tests understanding, not just function
- Assertions are semantic, not exact
- May need fuzzy matching or embeddings
What comprehension tests catch:
- Poor schema descriptions
- Ambiguous field names
- Missing context
- Misunderstanding of domain concepts
Level 4: End-to-End Tests
What you're testing: Complete workflows produce acceptable business outcomes.
describe("e2e: customer briefing workflow", () => {
beforeAll(async () => {
// Setup realistic data
await seedTestData({
customers: 10,
interactions: 50,
features: 20,
});
});
test("generates useful customer briefing", async () => {
const briefing = await runAgent(
"Give me a briefing on Acme Corp before my call with them tomorrow"
);
// Structure checks
expect(briefing).toContain("Acme Corp");
expect(briefing.length).toBeGreaterThan(200);
expect(briefing.length).toBeLessThan(2000);
// Content checks
const sections = ["overview", "recent", "issues", "opportunities"];
const sectionCount = sections.filter((s) =>
briefing.toLowerCase().includes(s)
).length;
expect(sectionCount).toBeGreaterThanOrEqual(2);
// No hallucination check
expect(briefing).not.toContain("I made up");
expect(briefing).not.toContain("[PLACEHOLDER]");
});
test("handles missing customer gracefully", async () => {
const briefing = await runAgent(
"Brief me on NonExistentCorp"
);
expect(briefing).toContain("no information");
expect(briefing).not.toContain("NonExistentCorp is a");
});
});Characteristics:
- Most expensive (time, tokens, infrastructure)
- Most realistic
- Hardest to maintain
- Most valuable for critical paths
Assertion Strategies for Non-Deterministic Output
Contains Assertion
expect(response).toContain("customer");Use for: Required elements that must appear
Pattern Matching
expect(response).toMatch(/\d+ customers/);Use for: Structured data within text
Semantic Similarity
const similarity = await cosineSimilarity(
embed(response),
embed(expectedMeaning)
);
expect(similarity).toBeGreaterThan(0.8);Use for: Meaning without exact wording
LLM-as-Judge
const evaluation = await runEvaluator(
`Does this response adequately brief someone on a customer?
Response: ${response}
Return JSON: { adequate: boolean, reasoning: string }`
);
expect(evaluation.adequate).toBe(true);Use for: Complex quality judgments
Negative Assertions
expect(response).not.toContain("I don't know");
expect(response.length).toBeLessThan(5000);Use for: Preventing known failure modes
CI/CD Considerations
Run Unit Tests Always
Every commit. Fast, cheap, deterministic.
Run Integration Tests on PR
Before merge. Moderate cost.
Run Comprehension Tests Daily
Or on significant prompt/schema changes. Catch understanding drift.
Run E2E Tests on Release
Before deploy. Full validation of critical paths.
Budget for Flakiness
Agent tests are inherently flakier. Set retry policies:
test:
retry: 2
timeout: 30sStart Here
- Get unit tests to 90%+ — Your foundation
- Add integration tests for each tool — Verify agent-tool interface
- Create comprehension tests for core templates — The differentiator
- Build E2E tests for critical workflows — Your safety net
Testing agents is harder than testing traditional software. But untested agents are untrustworthy agents. Build the pyramid.
Test Your Agents
Xtended's consistent API makes integration testing straightforward. Predictable responses help you build reliable agent test suites.
Get Started Free