$ cat ai-engineering/eval-driven-development.md

Eval-Driven Development for LLM Systems

Why evals replace unit tests for LLM systems: error-analysis-first workflow, golden sets, the test pyramid, and CI-gate harnesses in Python and TS.

Jatin Bansal@blog:~/ai-engineering$ open eval-driven-development

A team ships a small LLM feature behind a flag. Two engineers eyeball forty outputs, the answers look right, the flag rolls to 100%. Three weeks later finance flags a 4x cost spike, support flags a wave of complaints about wrong product names, and a competitor flags a public hallucination in a screenshot. The post-mortem finds three root causes — a model upgrade silently changed tool-calling behavior, a prompt edit broke output formatting on a long-tail query class, and a retriever change shifted the context distribution. None of these would have triggered a single unit test, because none of them is a unit-test-shaped failure. They are statistical drifts across an input distribution the team never enumerated. This is the failure mode every production LLM system hits eventually, and the only known antidote is an eval suite the team trusts more than the demo.

Opening bridge

Yesterday’s piece on production memory frameworks closed the memory subtree on a number — Mem0 at 94.4% LongMemEval, Zep at 71.2%, the contradiction-resolution category below 6% across the board. Every framework comparison rested on the same load-bearing artifact: a fixed eval. The memory-evaluation article worked the specific shape that artifact takes for memory systems; the RAG-evaluation article worked the shape it takes for retrieval pipelines. Today’s piece pulls back: what is the eval-driven workflow itself, regardless of whether the system under test is a RAG pipeline, an agent, a memory layer, or a plain prompt? This is the opening of the Evaluation subtree and the layer underneath every other subtree’s “measure it” sections.

Definition

Eval-driven development is the discipline of treating a versioned eval suite as the contract for an LLM application — every change to the system is judged by what it does to the suite, and the suite is the artifact that gates promotion. Three properties separate it from “we have some tests.” First, the suite is derived from observed failures, not from a priori specification — the input distribution and failure modes of an LLM system are too wide to enumerate up front, so the suite grows out of error analysis on real traces. Second, the suite measures cost, latency, and quality together — a change that improves quality by two points and triples per-call cost is a regression, and the eval makes that visible in the same dashboard. Third, the suite is a CI gate, not a research artifact — the cheap deterministic slice runs on every PR, the expensive LLM-judged slice runs nightly or pre-release, and the merge boundary is wired to a quality threshold.

A note on the name. “Eval-driven development” gestures at TDD, and that gesture is partly misleading. Hamel Husain and Shreya Shankar argue strongly that pure write-evals-first does not work for LLM systems, because the surface area is unbounded and you cannot enumerate failure modes from a spec sheet. The right loop is error analysis first, evals second — look at traces, categorize what is actually breaking, then write an eval that catches the category. The TDD shape (write the failing test, fix the bug, commit both together) survives; the “imagine all the failures” shape does not. Call it eval-driven development as long as you remember which half of the analogy holds.

Intuition

The mental model that pays off: an LLM application is a system whose input space is too large to enumerate, so your job is to enumerate the failure modes instead, and the eval suite is the materialized list of failure modes you have learned to expect. Each entry in the suite is a frozen fingerprint of a category of bug — a query type, a context pattern, an edge case in formatting — paired with the assertion that catches it. The suite starts small (the half-dozen failures you saw in week one), grows monotonically (every new failure mode found in production gets a row), and never shrinks unless the underlying capability is deliberately deprecated.

This is the inversion of unit testing. In a deterministic system, you write a test before the code, because the failure modes are known: the function takes an integer, it can be negative, zero, positive, or out of range. In an LLM system, the function takes English, and the failure modes are whatever the input distribution turns out to surface — typos that derail tokenization, idioms the model misreads, instruction-injection patterns hidden in user data, formatting quirks that break downstream parsing. You don’t write tests against an enumeration; you mine the enumeration out of the system in production.

Three signals separate a real eval suite from a vanity dashboard. First, every row has provenance — the trace that generated the failure, the date it was first observed, the engineer who added it. Second, the metric responds when the system regresses — if a known-broken behavior fails to drag the score down, the metric is averaging the bug into the noise floor and the suite needs sharper category-level scoring. Third, the threshold is non-trivially tight — a suite at 99% with rows that all pass is a suite that doesn’t actually gate anything; you need rows you barely pass and rows you sometimes fail so the threshold has signal.

The distributed-systems parallel

The cleanest analogue is the test pyramid plus production observability. The classical pyramid — many fast unit tests at the base, fewer integration tests in the middle, a handful of expensive end-to-end tests at the top — translates almost directly. The base of an LLM eval pyramid is deterministic assertions: regexes, JSON-schema validation, string match against expected substrings, length bounds, latency caps. These are pennies-per-run, deterministic, and run on every commit. The middle is code-based functional checks: did the agent call the right tool, did the retriever return the gold doc in the top-K, did the response stay under the token budget. Still cheap and deterministic, but inspecting semantics rather than syntax. The top is LLM-judged checks: faithfulness against retrieved context, answer relevance, helpfulness against a rubric. These are dollars-per-run, statistical, and run nightly or on a sampled subset. (Hamel Husain wrote up the discipline for the top layer specifically.)

The same pyramid underwrites a CI strategy borrowed from microservice deploys: the cheap deterministic layer is your unit-CI gate, blocking merge on regression; the expensive judged layer is your integration-CI nightly, alerting on trend movement. The judged layer’s noise floor is wide enough that single-run movement is rarely actionable — what you watch is the seven-day moving average and the per-category breakdown. The cheap layer’s noise floor is zero, so a one-point movement is real and merge-blocking.

The deeper parallel is closer to chaos engineering for the input distribution. Chaos engineering hypothesizes failures, injects them, and asserts the system survives. LLM evals hypothesize input distributions, replay them, and assert quality survives. Both disciplines accept that the system cannot be fully characterized by inspection and have to be probed empirically. Both produce libraries of injection scenarios (in chaos: pod kill, network partition, CPU stress; in evals: typo class, idiom class, jailbreak class) that grow with operational experience. Both are most valuable in the categories nobody on the team would have predicted.

There’s a real disanalogy worth flagging. In a deterministic system, a passing test is a proof. In an LLM eval, a passing score is a measurement — a single judge sampling noise, a single retriever ordering instability, a single tokenizer quirk can move a 0.85 to a 0.88 or back. The eval pyramid’s cheap layer recovers deterministic-test semantics by sticking to assertions that don’t depend on the judge. The middle and top layers admit the statistical regime; the discipline is to set thresholds with confidence intervals, not point estimates, and to gate on durable trends, not single-run swings.

Mechanics: the error-analysis-first loop

The loop that produces a useful eval suite is concrete enough to write down. Five steps, repeated weekly during early development, monthly once the system stabilizes:

Capture traces. Log every model call with its inputs, retrieved context, output, latency, token counts, and cost. Sample randomly or stratify by query type. The production tracing layer is the dedicated piece — span shape, OTel GenAI semantic conventions, sampling, and the build-vs-buy decision across the platform landscape; for now, just ensure traces are queryable.
Open-coded review. A single domain expert (not three) reads 50–100 traces back-to-back and writes free-text notes about every problem they see. No category schema yet. This is the “journaling” pass Hamel and Shreya borrow from qualitative research — the goal is to surface failure modes you didn’t know existed.
Axial coding. Cluster the open-coded notes into categories: “wrong product name,” “ignored the prior turn,” “output JSON missed a required field,” “took the long path through the tool call graph.” Most categories will be small; a few will dominate the volume. The categories are now your taxonomy.
Write the eval row. For each category, pick the cheapest assertion that catches it. Schema validation for the JSON case. Substring match for the product-name case. Tool-call-sequence assertion for the wrong-path case. LLM-as-judge with a specific rubric for the “ignored prior turn” case where no cheaper assertion exists. Add a row to the suite for each.
Wire to CI and watch the trend. The deterministic rows gate the next merge. The judged rows trend on the dashboard. When a new failure shows up in production, return to step 1 with that trace.

The trap that kills internal evals is going straight from “we want quality” to “let’s write a faithfulness judge” without the open-coded pass. You end up with a metric that measures a generic property nobody on the team can characterize, scores in the 0.7–0.85 band on every change, and has no relationship to the bugs your users actually report. Error analysis grounds the eval in the system’s actual failure distribution.

Mechanics: golden sets versus live evals

The eval inputs come from two places, and both are necessary. The golden set is a frozen, curated list of inputs — typically 50–500 — checked into version control, never modified except to add new rows or deprecate old ones. It’s the regression unit. Every change to the system is compared against the suite’s previous score on exactly the same inputs. Without this, you cannot tell whether a metric movement is system change or input change. The golden set is small enough to be hand-curated and stable enough to be a contract.

Live evals sample real production traffic, redact PII, and score the sample asynchronously. They catch distribution drift — the kinds of queries users started sending after Wednesday’s marketing push — that the golden set, frozen six months ago, doesn’t represent. The dashboard shows both: golden score (changes only when the system changes) and live score (changes when the system or the input distribution changes). The gap between them is the distribution drift signal.

A practical rule for golden-set construction: every row should map to a specific category from your error analysis. A row that exists “for coverage” of some imagined edge case is a row you can’t interpret when it regresses. The NurtureBoss case study in Hamel’s evals piece is the canonical worked example — they grew their golden set by mining the cases their date-handling logic had previously failed on, not by brainstorming dates a user might type.

Mechanics: cost and latency as first-class metrics

A common failure mode is to track quality alone, leaving cost and latency in a separate dashboard nobody opens. The day you ship a model upgrade that improves faithfulness by three points and triples per-call cost, the quality-only dashboard cheers and the finance team panics. Treat each eval row as emitting three numbers — quality score, p50 token cost, p95 latency — and require the suite’s release-gating threshold to include all three. The right framing is the Pareto frontier: a change that moves quality up and cost down is unambiguous; a change that moves only one and worsens the other is a judgment call that has to happen on the PR, not after deploy. This is the same idea the prompt-caching article develops for the inference layer and the RAG-evaluation article develops for retrieval pipelines.

Code: a minimal Python eval harness

The simplest working harness is a couple hundred lines of Python that ingests a golden set, runs the system under test, evaluates each row with a mix of deterministic and judged checks, and writes a result row per commit SHA. Install: pip install anthropic pydantic (and your test runner — pytest works fine).

python

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
# pip install anthropic pydantic
import json
import time
from dataclasses import dataclass
from typing import Callable
from anthropic import Anthropic
from pydantic import BaseModel

client = Anthropic()

@dataclass
class EvalRow:
    id: str
    category: str           # from error analysis
    input: dict             # whatever shape the system takes
    must_contain: list[str] = None       # deterministic substring check
    must_not_contain: list[str] = None
    schema: type[BaseModel] | None = None  # deterministic schema check
    judge_rubric: str | None = None        # LLM-judged check
    max_latency_ms: int | None = None
    max_cost_usd: float | None = None

@dataclass
class EvalResult:
    row_id: str
    category: str
    passed_deterministic: bool
    judge_score: float | None
    latency_ms: int
    cost_usd: float
    error: str | None = None

def llm_judge(output: str, rubric: str) -> float:
    """1-5 score from a small judge model against a rubric."""
    prompt = (
        f"Rubric:\n{rubric}\n\n"
        f"Output:\n{output}\n\n"
        f"Score this 1-5 against the rubric. Reply with only the integer."
    )
    msg = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=4,
        messages=[{"role": "user", "content": prompt}],
    )
    try:
        return int(msg.content[0].text.strip()) / 5.0
    except (ValueError, IndexError):
        return float("nan")

def run_row(row: EvalRow, system_under_test: Callable) -> EvalResult:
    t0 = time.monotonic()
    try:
        out = system_under_test(row.input)
    except Exception as e:
        return EvalResult(row.id, row.category, False, None, 0, 0.0, str(e))
    latency_ms = int((time.monotonic() - t0) * 1000)

    # Deterministic checks
    passed = True
    if row.must_contain:
        passed &= all(s in out["text"] for s in row.must_contain)
    if row.must_not_contain:
        passed &= all(s not in out["text"] for s in row.must_not_contain)
    if row.schema:
        try:
            row.schema.model_validate_json(out["text"])
        except Exception:
            passed = False
    if row.max_latency_ms:
        passed &= latency_ms <= row.max_latency_ms

    # Judged check (only if deterministic passed; cheap-first ladder)
    judge_score = None
    if passed and row.judge_rubric:
        judge_score = llm_judge(out["text"], row.judge_rubric)

    return EvalResult(
        row_id=row.id,
        category=row.category,
        passed_deterministic=passed,
        judge_score=judge_score,
        latency_ms=latency_ms,
        cost_usd=out.get("cost_usd", 0.0),
    )

def run_suite(rows: list[EvalRow], system_under_test: Callable) -> dict:
    results = [run_row(r, system_under_test) for r in rows]
    by_cat = {}
    for r in results:
        by_cat.setdefault(r.category, []).append(r)
    return {
        "pass_rate": sum(r.passed_deterministic for r in results) / len(results),
        "judge_mean": (
            sum(r.judge_score for r in results if r.judge_score is not None)
            / max(1, sum(1 for r in results if r.judge_score is not None))
        ),
        "p95_latency_ms": sorted(r.latency_ms for r in results)[int(len(results) * 0.95)],
        "total_cost_usd": sum(r.cost_usd for r in results),
        "by_category": {
            cat: {
                "pass_rate": sum(r.passed_deterministic for r in rs) / len(rs),
                "n": len(rs),
            }
            for cat, rs in by_cat.items()
        },
        "results": results,
    }

The harness is deliberately minimal so the shape stays visible. Three things to flag. First, the cheap-first ladder — judged checks only run when deterministic checks pass; a row that fails schema validation doesn’t pay for a judge call. Second, per-category breakdown — the aggregate is for the dashboard, the per-category numbers are for action. Third, the cost and latency rollups are first-class outputs, not afterthoughts. A CI script wraps this with a pass/fail decision: result["pass_rate"] >= 0.95 and result["judge_mean"] >= 0.80 and result["p95_latency_ms"] <= 3000.

A worked row that exercises every assertion type:

python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
class OrderResponse(BaseModel):
    order_id: str
    status: str
    items: list[dict]

ROWS = [
    EvalRow(
        id="order_status_basic",
        category="order_status_intent",
        input={"prompt": "What's the status of order 12345?"},
        must_contain=["12345"],
        must_not_contain=["I don't know", "as an AI"],
        schema=OrderResponse,
        judge_rubric=(
            "The response answers the user's order status question, "
            "stays factual to the tool output, and is under 2 sentences."
        ),
        max_latency_ms=4000,
    ),
    # ...49 more rows, one per category × variant from error analysis
]

Code: a TypeScript harness with Promptfoo

On the TypeScript side, Promptfoo has become the dominant CI-gating tool — declarative YAML configs, dozens of provider integrations, native CLI for assert and npx promptfoo eval in a CI step. Install: npm install -g promptfoo or as a dev dependency. The config below scores three rows against a Claude-backed assistant, asserts on substring, schema, and judged faithfulness, and emits a JSON summary the CI script can gate on.

yaml

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
# promptfooconfig.yaml
prompts:
  - |
    You are a support assistant. Given the user message, decide whether
    to call get_order_status(order_id) or reply directly. Respond as JSON
    matching {action: "call_tool"|"reply", order_id?: string, text?: string}.

    User message: {{user_input}}

providers:
  - id: anthropic:messages:claude-sonnet-4-6
    config:
      max_tokens: 256

tests:
  - description: "order_status_intent: extracts order id and routes to tool"
    vars:
      user_input: "What's the status of order 12345?"
    assert:
      - type: contains
        value: "12345"
      - type: is-json
        value:
          required: ["action"]
          properties:
            action: { enum: ["call_tool", "reply"] }
      - type: javascript
        value: |
          const parsed = JSON.parse(output);
          parsed.action === "call_tool" && parsed.order_id === "12345"
      - type: latency
        threshold: 4000
      - type: llm-rubric
        value: "The response routes to the order-status tool with the right id."

  - description: "out_of_scope: refuses non-support questions politely"
    vars:
      user_input: "Write me a poem about ducks."
    assert:
      - type: not-contains
        value: "duck"
      - type: llm-rubric
        value: "Refuses the request and points back to the support domain."

  - description: "ambiguous: asks for missing order id"
    vars:
      user_input: "Where is my order?"
    assert:
      - type: contains-any
        value: ["order number", "order id", "could you share"]
      - type: latency
        threshold: 3000

outputPath: ./eval-results.json

typescript

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
// scripts/run-evals.ts (CI-gating wrapper)
import { execSync } from "node:child_process";
import { readFileSync } from "node:fs";

execSync("npx promptfoo eval -c promptfooconfig.yaml", { stdio: "inherit" });

interface PromptfooResult {
  results: {
    stats: {
      successes: number;
      failures: number;
      tokenUsage: { total: number };
    };
  };
}

const results: PromptfooResult = JSON.parse(
  readFileSync("./eval-results.json", "utf8"),
);
const { successes, failures, tokenUsage } = results.results.stats;
const passRate = successes / (successes + failures);

if (passRate < 0.95) {
  console.error(`Eval gate failed: ${(passRate * 100).toFixed(1)}% pass rate`);
  process.exit(1);
}
console.log(`Eval gate passed: ${(passRate * 100).toFixed(1)}% pass rate, ` +
            `${tokenUsage.total} tokens consumed`);

The shape mirrors the Python harness: cheap deterministic assertions (contains, is-json, javascript, latency) run first; the llm-rubric assertion only runs when the cheap layer passes. Promptfoo’s matrix-of-providers feature lets the same suite run against Claude, GPT, and Gemini side-by-side, which is how teams answer “should we upgrade the model” without taking a guess — see also OpenAI’s Evals API and the open-source openai/evals framework, Anthropic’s Console evaluation tool, and the managed platforms Braintrust and LangSmith for variants of the same harness pattern.

Trade-offs, failure modes, gotchas

The “we’ll write evals later” trap. Every team that ships an LLM feature without an eval suite says they will write one once they have time. They never do. The eval suite is what gives you time — without it, every change is a guess and every incident is a fresh investigation. Write the first ten rows the day you ship the first feature; the cost is hours, not weeks.

Imagined failure modes versus observed failure modes. The single biggest source of wasted eval effort is rows written from imagination rather than from traces. They cover behaviors users don’t exercise and miss behaviors users actually trigger. Audit your suite quarterly: which rows have never failed across 50+ runs? Either they’re trivially easy and don’t gate anything, or they’re testing a non-issue. Either way, retire them and reinvest the budget in rows derived from production traces.

LLM-judge calibration drift. A faithfulness score from Claude Sonnet 4.6 is not the same scale as a faithfulness score from GPT-5.5. Pin the judge model in your eval config, treat a judge upgrade as a metric reset, and dual-run for at least a week when transitioning. The RAG-evaluation article goes deep on the judge problem; the same warnings apply at the application layer.

Position bias in pairwise judges. If you use pairwise rather than pointwise judging (“which is better, A or B?”), the judge often prefers the first option shown by 5+ points. Mitigate by running each pair twice with positions swapped and averaging. The LLM-as-judge article is the deep dive on this bias and the others (verbosity, self-preference, length) that show up in every untreated judge pipeline.

Confusing pass rate with quality. A 95% pass rate on a 50-row suite is a 2.5-row noise floor — moving from 95% to 93% might mean a single row started failing, which is either a real regression or a one-judge-flake away from being recovered. Either grow the suite past 200 rows or report bootstrapped confidence intervals on every score so you can tell the difference between signal and noise.

Eval-set rot. The corpus your system serves changes; the queries users ask change; the categories that mattered six months ago are not the categories that matter today. Refresh the golden set on a quarterly cadence, but archive the old snapshots — long-running trend lines are exactly what makes “is the system getting better” answerable.

Evals as vanity. A suite at 99% with no rows that ever fail is a suite that does no work. The threshold should be uncomfortable — you should sometimes fail it, fix the issue, and re-merge. If the threshold sits comfortably above your scores, either tighten it or add the harder failure modes you’ve been avoiding. A green dashboard that doesn’t correspond to a green product is the worst-case outcome.

Per-category scoring beats the aggregate. A suite at 85% with 95% pass on routine queries and 30% on adversarial queries is a different product from a suite at 85% with 85% across the board. The first is safe for the routine path and unsafe for the long tail; the second is uniformly mediocre. The aggregate doesn’t tell you which one you have. Always slice by error-analysis category.

Don’t outsource the early loop. The first 100 traces you review are where you learn what your system actually does. Outsourcing this to a vendor — or to an LLM — loses the design signal that the error-analysis pass is for. Hamel makes the case explicitly: an internal domain expert as the final judge, owning the rubric, is what keeps the suite honest. LLM-assistance is fine once the categories are stable; it’s a poor substitute during discovery.

What to read next

LLM-as-Judge: Pointwise and Pairwise — the deep dive on the top tier of this pyramid: rubric design, pointwise vs pairwise modes, the four biases (position, verbosity, self-preference, length) and their mitigations, and the human-calibration loop that makes the judge’s verdict mean something.
Production Tracing and Observability for LLM Systems — the online counterpart to the offline eval suite. Span shape, OpenTelemetry GenAI conventions, sampling and PII policies, and the platform decision across LangSmith, Langfuse, Phoenix, Datadog, and Honeycomb. Captures what happened per turn; the eval suite captures what happens on average.
Drift Detection and Regression Testing for LLM Systems — the control loop that sits across the offline suite and the online trace store. Input drift, output drift, concept drift; the paired-bootstrap protocol for shipping a model upgrade safely. The dynamic counterpart to the static eval gate this article builds.
Human-in-the-Loop Feedback Loops for LLM Systems — the production loop that turns user feedback and reviewer annotations into new eval rows. The error-analysis-first workflow this article opens is what that piece operationalises at steady state — the closing of the loop the suite needs to keep growing.