Building an Agent Eval Suite in Practice

Category: Agent Building | ~2800 words

The customer support agent had a 98% pass rate on the eval suite. Every day, for three weeks after the model update, it silently routed a specific class of refund requests — those involving partial shipments — to a queue no one monitored. The evals never caught it because the evals never tested it. The suite had 200 tests, all of them variations of the happy path the agent was originally prompted against.

This is not a story about eval philosophy. It is not about Goodhart’s Law or metric gaming. It is about a specific, practical failure: the eval suite did not represent the input distribution the agent encountered in production. When the model update subtly changed how the agent classified compound requests, no test existed to catch it.

This post is about how to build a suite that would have caught it — and that catches the classes of failures that actually kill agents in production.

Why Standard Software Testing Breaks for Agents

Unit tests assume deterministic functions. Given input X, expect output Y. The test either passes or fails, reproducibly. You can write assertions, run them in a loop, and get a green checkmark. This model breaks down in three distinct ways when you apply it to agents.

The non-determinism problem. The same input to an LLM-based agent will not produce the same output twice. Temperature, sampling, and model internals mean that even “correct” agents will sometimes produce subtly different responses. A naive assertion that checks for an exact string match will produce flaky tests that erode trust in the suite. Teams respond by either widening their assertions until they catch nothing, or running single samples per test case and missing failures that occur 20% of the time.

The output space problem. An agent’s output is not a return value. It is often a long string of natural language, a sequence of tool calls, or both. The space of valid outputs is enormous. A test that verifies “the agent responded” is nearly useless. A test that verifies “the agent responded with this exact sentence” is brittle and wrong. The interesting region — outputs that are valid but wrong — is hard to express with standard assertions.

The compound action problem. Real agents do not produce single outputs. They reason, call tools, receive results, reason again, and produce a final output. A failure can occur at any step: the initial classification might be correct but a tool call might be malformed, or the tool call might succeed but the agent might misinterpret the result. Standard testing has no primitives for asserting properties across a multi-step trace.

These three problems do not mean you cannot test agents — they mean you need a different taxonomy of tests.

Four Eval Categories That Actually Matter

1. Deterministic Checks

These are tests where the agent either complies or it does not. They do not require semantic understanding. They do not require an LLM judge. They run fast and they should block deployment if they fail.

Examples:

Output format compliance. If the agent is supposed to return JSON, does it always return valid JSON? If it is supposed to call a specific tool, does it use the correct schema?
Tool call validity. Does every tool call use arguments that exist in the tool’s schema? Does the agent ever hallucinate tool names?
Constraint compliance. Does the agent stay within its authorized scope? A customer support agent should never issue refunds above a certain threshold without escalating. That is verifiable deterministically.
No credential or PII leakage. Does the agent ever repeat back the user’s password, credit card number, or other sensitive data it was passed in context? Run a regex over every output. This should never fail.
Bounded response length. Does the agent ever produce responses that are wildly out of range — either empty or multi-thousand-word walls of text when a short answer is expected?

These checks are binary. Write them as strict assertions. Run them on every test case in your corpus, on every run. If any of them fail, the agent does not ship.

The key insight is that these tests require zero LLM calls to evaluate. They are pure functions over the agent’s output. That makes them fast, cheap, and reliable enough to run in CI on every commit.

2. Property-Based Tests

Property-based tests assert invariants that should hold across a wide range of inputs, without specifying what the exact output should be. The terminology comes from software testing (see: Hypothesis, QuickCheck), but the pattern is directly applicable to agent evaluation.

For an agent, properties might look like:

Attempt the task. The agent should always make a meaningful attempt at the user’s request — not immediately refuse, not produce a non-sequitur, not loop forever.
Coherence. The final response should be grammatically coherent, in the expected language, and topically related to the input.
No hallucinated tool calls. If the agent made three tool calls and all three returned results, the final response should reference information consistent with those results — not fabricate a fourth data source.
Escalation when confused. If the agent encounters an ambiguous input, it should ask for clarification or escalate — not silently pick the wrong interpretation and proceed.

Property tests require more sophistication to evaluate than deterministic checks, but many of them can still be partially automated with heuristics: language detection, embedding similarity thresholds, schema validation against tool call logs. Use heuristics first; reach for LLM-as-judge only when heuristics cannot express the property.

3. LLM-as-Judge Tests

LLM-as-judge evaluation — using a second language model to evaluate the output of your agent — has become standard practice for quality dimensions that resist formal specification. Research has validated that strong LLM judges achieve high agreement with human preferences on many evaluation tasks, though the technique has meaningful failure modes (arXiv:2306.05685).

Use LLM-as-judge for:

Accuracy. Did the agent’s answer correctly address the user’s question, given the available context?
Helpfulness. Was the response actionable, or did it hedge so much it was useless?
Tone compliance. Did the agent maintain the required persona and register?
Reasoning quality. For complex multi-step tasks, did the agent reason correctly to its conclusion, or did it get lucky?

Three rules for LLM-as-judge that practitioners often learn the hard way:

Use a stronger or different model as judge. If your agent runs on GPT-4o, do not use GPT-4o to judge it. The judge will share the same failure modes. Use a model from a different family, or a larger model with a more explicit evaluation rubric.

Require structured output with rubric scores, not open-ended verdicts. A judge that returns “this answer is good” is useless. A judge that returns a JSON object with scores on each dimension (accuracy: 4/5, helpfulness: 3/5, tone: 5/5) with explicit reasoning is auditable and comparable across runs.

Track judge agreement with humans. Periodically sample 20-30 cases and have a human rate them. Measure how often the LLM judge agrees. If agreement drops, your judge is drifting — possibly due to a judge model update. This is not a one-time calibration; it is an ongoing monitoring task.

LLM-as-judge tests are slow and expensive. Do not run them on every commit. Run them on a sampled set, nightly or pre-release.

4. Regression Baselines

A regression baseline is a curated set of (input, expected behavior) pairs that you commit to your repository and protect. Every model update, every prompt change, every dependency bump must pass the baseline before deployment.

The critical word is “curated.” A good baseline is not a random sample of past conversations. It is a deliberate collection of:

Edge cases you have seen fail in production. Every production incident should produce at least one new baseline entry.
Cases that represent distribution shift risk. If you know your agent handles a specific class of inputs poorly, include those.
Adversarial inputs. Inputs that try to jailbreak the agent, confuse it, or push it outside its intended scope. These should always result in the same safe behavior.
High-value happy-path cases. The core use cases that must always work. These are your canaries.

Baselines need to evolve. When you update your model or prompt and the correct behavior changes, update the baseline. When you discover a new failure class, add it. A baseline that is six months stale is a false security blanket.

The mechanism for evaluating baseline cases depends on the type. Format checks and constraint checks can use deterministic assertions. Quality checks need LLM-as-judge. Structure your baseline entries to include both the input and the evaluation method for each expected behavior.

Implementation Patterns

Structuring a pytest-style Eval Suite

The following pattern separates fast, cheap evals from slow, expensive ones:

import pytest
import json

# Fixtures load test cases from your corpus
@pytest.fixture(scope="session")
def eval_corpus():
    with open("evals/corpus.json") as f:
        return json.load(f)

# --- TIER 1: Deterministic checks (run on every commit) ---

class TestDeterministic:
    def test_output_is_valid_json(self, agent, eval_corpus):
        for case in eval_corpus["format_cases"]:
            result = agent.run(case["input"])
            # Should never raise; if it does, the agent is broken
            parsed = json.loads(result.content)
            assert parsed is not None

    def test_no_pii_leakage(self, agent, eval_corpus):
        pii_patterns = [r"\b\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}\b"]  # credit cards
        for case in eval_corpus["pii_cases"]:
            result = agent.run(case["input"])
            for pattern in pii_patterns:
                assert not re.search(pattern, result.content), \
                    f"PII detected in output for case {case['id']}"

    def test_tool_calls_use_valid_schema(self, agent, eval_corpus):
        for case in eval_corpus["tool_cases"]:
            trace = agent.run_with_trace(case["input"])
            for tool_call in trace.tool_calls:
                assert tool_call.name in ALLOWED_TOOLS, \
                    f"Hallucinated tool: {tool_call.name}"
                validate_schema(tool_call.arguments, TOOL_SCHEMAS[tool_call.name])

# --- TIER 2: Property checks (run on every commit, sampled) ---

class TestProperties:
    @pytest.mark.parametrize("case", sample(PROPERTY_CASES, n=50))
    def test_agent_attempts_task(self, agent, case):
        result = agent.run(case["input"])
        # Heuristic: response must be non-empty and on-topic
        assert len(result.content) > 20
        assert embedding_similarity(result.content, case["input"]) > 0.3

    def test_no_hallucinated_sources(self, agent, eval_corpus):
        for case in eval_corpus["rag_cases"]:
            trace = agent.run_with_trace(case["input"])
            cited_sources = extract_citations(trace.final_output)
            retrieved_sources = {doc.id for doc in trace.retrieved_docs}
            assert cited_sources.issubset(retrieved_sources), \
                "Agent cited a source it did not retrieve"

# --- TIER 3: LLM-as-judge (run nightly, pre-release) ---

@pytest.mark.slow
class TestQuality:
    def test_accuracy_on_baseline(self, agent, judge_llm, baseline_corpus):
        scores = []
        for case in baseline_corpus:
            result = agent.run(case["input"])
            score = judge_llm.evaluate(
                input=case["input"],
                output=result.content,
                rubric=case["rubric"]
            )
            scores.append(score.accuracy)

        mean_score = sum(scores) / len(scores)
        assert mean_score >= ACCURACY_THRESHOLD, \
            f"Mean accuracy {mean_score:.2f} below threshold {ACCURACY_THRESHOLD}"

Handling Non-Determinism

For tests that depend on probabilistic behavior, run multiple samples and use statistical thresholds:

def test_refund_escalation_rate(self, agent):
    """Agent should escalate >95% of partial-shipment refund requests."""
    cases = load_cases("partial_shipment_refunds")
    escalated = 0

    for case in cases:
        # Run each case 3 times; count as escalated if any run escalates
        for _ in range(3):
            result = agent.run(case["input"])
            if result.escalated:
                escalated += 1
                break

    rate = escalated / len(cases)
    assert rate >= 0.95, f"Escalation rate {rate:.2%} below 95% threshold"

The key decisions are: how many runs per case, and what threshold counts as pass. These should be calibrated against your observed variance, not chosen arbitrarily.

Fast/Slow Split and CI Integration

Wire the tiers into your CI pipeline with separate stages:

# .github/workflows/agent-evals.yml
jobs:
  deterministic-evals:
    runs-on: ubuntu-latest
    steps:
      - run: pytest evals/ -m "not slow" --timeout=60
    # Runs in < 2 minutes, blocks merge

  quality-evals:
    runs-on: ubuntu-latest
    if: github.event_name == 'schedule' || contains(github.event.head_commit.message, '[eval]')
    steps:
      - run: pytest evals/ -m slow --timeout=600
    # Runs nightly and on-demand; alerts on failure but doesn't block merge

The fast tier should complete under 2 minutes and block every pull request. The slow tier runs nightly and before any production release. Never put LLM-as-judge in the fast tier — a single slow LLM call in CI will cause developers to skip or disable the evals.

The Minimum Viable Eval Suite

Before any agent reaches production, it should have at minimum these ten tests:

Format contract test. Every output conforms to the expected schema (JSON, specific fields, required keys).
Tool call allowlist. The agent never calls a tool not in the approved set.
Constraint compliance. The agent respects its defined operating limits (amount thresholds, scope boundaries, rate limits).
No PII echo. Sensitive data passed in context does not appear verbatim in output.
Task attempt rate. On a representative sample of valid inputs, the agent meaningfully attempts the task on at least 95% of runs.
Escalation on ambiguity. When given deliberately ambiguous inputs, the agent asks for clarification or escalates rather than silently guessing.
Adversarial robustness. A set of jailbreak and prompt injection attempts always produces a refusal or safe default behavior — never compliance.
Accuracy on core use cases. LLM-as-judge score on the agent’s primary task type clears a minimum threshold, measured against a curated baseline.
Regression on known failure cases. Every production incident has a corresponding test case. None of them regress.
Latency and cost bounds. The agent completes tasks within acceptable time and token limits. Runaway loops are caught before they hit users.

This is not a comprehensive suite. It is the floor. An agent with these ten tests has basic coverage of format, safety, behavior, and regressions. An agent without them is flying blind.

The Hard Truth About Agent Evals

Most agent eval suites are a form of wishful thinking. They test the happy path against prompts similar to the ones the agent was developed on, run a single sample per case, and check outputs with assertions that would pass on a response the agent has essentially memorized. They give you a green bar. They do not give you confidence.

Real evals test what happens when things go wrong: when the input is slightly malformed, when the user tries to push the agent outside its scope, when the tool returns an unexpected error, when the model update shifts behavior in a subtle but consequential way. AgentBench demonstrated this concretely — even strong commercial models that perform well on standard benchmarks fail systematically in multi-step agentic environments that require long-horizon reasoning under uncertainty (arXiv:2308.03688). GAIA, designed around real-world assistant tasks, found a 77-point gap between human performance (92%) and top AI systems (15%) on questions that should theoretically be within reach of capable agents — the failures are not in exotic edge cases but in compound, realistic tasks (arXiv:2311.12983).

The lesson from both benchmarks is the same: evaluating agents on simplified, controlled scenarios produces results that do not transfer to production. Your eval suite must represent the actual distribution of inputs your agent will face — including the partial shipment refund requests, the ambiguous edge cases, the adversarial users, and the compound failures that only emerge when multiple things go slightly wrong at once.

Write tests for the failures you have already seen. Write tests for the failures you are afraid of. Run them against multiple samples, with statistical thresholds you can defend. Maintain a living baseline that grows every time production surprises you.

A 98% pass rate on a test suite you built against your own assumptions is not safety. It is a number you are telling yourself. The question is whether your eval suite is capable of surprising you — of catching something you did not already know was wrong.

If it cannot, you do not have an eval suite. You have a ritual.

References

Liu, X. et al. (2023). AgentBench: Evaluating LLMs as Agents. arXiv:2308.03688. https://arxiv.org/abs/2308.03688
Mialon, G. et al. (2023). GAIA: A Benchmark for General AI Assistants. arXiv:2311.12983. https://arxiv.org/abs/2311.12983
Zheng, L. et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv:2306.05685. https://arxiv.org/abs/2306.05685