$ cat ai-engineering/drift-detection.md

Drift Detection and Regression Testing for LLM Systems

Detecting input and output distribution shift in LLM apps, plus the regression-testing protocol for model upgrades: shadow runs, canaries, judge replays.

Jatin Bansal@blog:~/ai-engineering$ open drift-detection

A team pins the production assistant at claude-sonnet-4-6, watches the eval-suite sit at 91% pass rate for six weeks, and ships nothing. The graph is flat. The dashboards are green. Then a Tuesday support ticket reports that the assistant has been describing the company’s flagship product with a competitor’s feature for a week, and a closer look at production traces shows the user-message mix shifted hard after a marketing push — pricing questions used to be 4% of traffic and are now 31%, the system prompt was tuned for the old mix, and the judged pass rate on the new dominant category is 62%. The eval suite never caught it because the golden set was frozen against the old mix; the dashboard never caught it because the aggregate rolled the new failures into a noise floor that hadn’t been recomputed. The system regressed without the code or the model changing. This is what drift looks like in an LLM app, and the discipline that surfaces it before the support escalation is the third leg of the evaluation stool — the one that ties the offline eval suite and the online trace store together into a closed loop.

Opening bridge

Yesterday’s piece on production tracing closed the observability layer: span shape, OTel GenAI conventions, sampling, the platform decision across LangSmith, Langfuse, Phoenix, Datadog, and Honeycomb. The trace store is the substrate — what happened, per turn, queryable. The eval-driven development article closed the offline counterpart: the golden set, the test pyramid, the LLM-as-judge at the top of it. Today’s piece is the control system that sits across both. Traces capture what happened; evals capture what should happen on average; drift detection watches the difference and the trend, and the regression-testing protocol is what you run before you ship a new model into the loop. Without this layer, the trace store is forensic and the eval suite is decorative — they tell you about the past and the snapshot, but they don’t keep production honest week over week.

Definition

Drift detection for LLM systems is the discipline of monitoring three distributions over time — the input distribution (what users send), the output distribution (what the model returns), and the eval-score distribution (how the suite rates the output) — and alerting when any of them diverges materially from a fixed reference. Three properties separate it from “we have monitoring.” First, the reference is explicit and versioned — a reference_window of production traces (typically the last stable 7–30 days) with computed centroids, score distributions, and category mixes, against which the current window is statistically compared. Second, the alert is multi-axis — a single quality number rolling down is the symptom, but the actionable signal is the conjunction of which input cluster grew, which output property changed, and which eval category regressed. Third, the response is bounded — drift is a hypothesis, not a verdict; the operational reaction is to open the trace store on the drifted slice, run a targeted judged re-score, and either bless the new distribution (refresh the reference) or roll back the change that introduced it.

Regression testing for LLM systems is the discipline of running a candidate change — model upgrade, prompt edit, retrieval reconfiguration — against a fixed golden set and a replay of recent production traffic, comparing results against the production baseline with paired statistical tests, and shipping only when both pass. Three properties. First, the comparison is paired — every input is scored on both the candidate and the baseline, and the test statistic is the per-row delta, not the marginal scores. Second, the significance threshold beats the noise floor — bootstrapped confidence intervals over per-category deltas, not point estimates over aggregates, because the judge has a measurable per-sample standard deviation and ignoring it is how teams ship regressions disguised as improvements. Third, the rollout is gradual — shadow first (compare outputs without serving them), canary second (1%, 5%, 20%, 50%, 100% traffic with monitoring at each stop), full cutover last, with a dual-run window where the baseline is still warm enough to roll back to.

The two disciplines compose. Regression testing protects the deploy boundary; drift detection protects the steady-state boundary. Skipping either is how systems go quietly bad: a perfect deploy that drifts over the next two months, or a tight steady-state monitor that lets a regression in on the first deploy of a new model.

Intuition

The mental model: an LLM application is a stationary system whose four moving parts — input distribution, prompt, model weights, retrieval state — can each shift without notice, and any single shift can move the output distribution enough to fail your users. Each moving part has its own observable shadow. The input distribution shadow is the embedding cluster of user messages, sampled per hour and compared against a rolling baseline. The prompt shadow is the version stamp on every span — a prompt edit lands in the trace store at a specific commit SHA, and any drift after that point is colored by that edit. The model-weights shadow is the provider’s version string and the judge-score trend over a fixed eval slice — when the provider quietly upgrades the model, the judge sees it even if the version string didn’t change. The retrieval shadow is the recall@k on a fixed query set against the current index.

The reason drift is the third leg of evaluation, not the first or second, is that it requires both substrates to be in place. Without traces, you have no production-side distribution to monitor. Without the eval suite, you have no fixed reference to compare current quality against. Drift detection eats both as inputs — it pulls the current window from the trace store, the reference window from the trace store or eval log, and the per-row judged scores from the eval pipeline. Build it after both of those, not before.

The right unit of analysis is a category-conditional drift signal, not an aggregate. An aggregate that drifts by 3 points hides a single category dropping by 15 points; the per-category breakdown surfaces the failure mode and points at which trace cluster is the source. The same logic from the eval-driven development article applies — slice by error-analysis category every time — but at the monitoring layer the slicing is per-cluster instead of per-rubric. The two slicings converge in practice: an input cluster usually maps to a small number of eval-suite categories, and an output drift on a category usually maps to one or two input clusters that started behaving differently.

The distributed-systems parallel

The closest analogue is a control loop over a stationary process. SRE teaches the discipline: define the SLI, the SLO, and the error budget; instrument the system; detect deviations from the SLO; alert when the error budget burns past a threshold; investigate; fix or accept. LLM drift detection is the same shape, with two adjustments for the new substrate. First, the SLI is statistical, not deterministic. The microservice SLI “p95 latency < 300ms” is a hard number with no noise floor worth modeling; the LLM SLI “judge-rated faithfulness ≥ 0.85” carries a confidence interval that has to be modeled or the alert noise is unmanageable. Second, the inputs are not stationary by design. A microservice expects its requests to be roughly stationary modulo seasonality; an LLM app expects its requests to drift constantly as the user mix changes, the marketing team writes new copy, and the seasons turn. The drift detector is for the non-stationary regime — surfacing the shifts that matter and ignoring the ones that don’t.

The deeper parallel is canary deploys and progressive delivery, applied to a stochastic system. The microservice canary uses request-level metrics and rolls back at the first SLO breach. The LLM canary can’t do that — single-request quality is noisy by construction, and a single bad output is not signal. The instrumented protocol replaces request-level rollback with window-level rollback: the canary serves 1% of traffic, the judge re-scores a sample of the canary’s traces every hour, the per-category quality delta is computed over the window, and the rollback gates on a confidence-interval-aware significance threshold against the baseline window. Same shape, different statistic.

A real disanalogy worth flagging. Microservices have a stable mapping from input → output that the canary can rely on — same request goes to two versions, the outputs are deterministically comparable. LLM outputs are sampled from a distribution; the same input to the same model returns different outputs across calls. The shadow-mode comparison doesn’t compare outputs directly; it compares the judged scores of outputs, and even that comparison needs paired statistics over a sample, not a point estimate. This is why every serious LLM shadow-mode framework — Phoenix, Langfuse comparisons, Promptfoo’s matrix-of-providers — runs the comparison through a judge, not a string diff.

Mechanics: the three axes of drift

The signals factor into three axes. Detecting any one without the others is half-blind.

Input drift. What users are sending. The cheap shadow is category mix: every user message is classified into an error-analysis category at ingest, the per-category count is logged, and a Population Stability Index (PSI) between the current 24-hour window and the reference 7-day window is computed once an hour. PSI above 0.2 flags material drift in the literature, above 0.1 is worth a glance. The expensive shadow is embedding centroid distance: the text-embedding of every user message goes into the trace store, the current-window centroid is the mean of those embeddings, and the Euclidean (or cosine) distance to the reference-window centroid is the drift signal. Arize Phoenix’s embedding-drift module is the canonical implementation — it clusters with HDBSCAN, projects with UMAP, and exposes the centroid distance as a per-cluster trend line. The dual signal — category PSI plus centroid distance — catches both the “users are asking new things” failure mode (PSI fires; centroid moves) and the “users are asking similar things differently” mode (PSI flat; centroid moves) that a single signal would miss.

Output drift. What the model is returning. The shadow at this layer is per-output features that don’t require judging — output length distribution, refusal rate, tool-call frequency per turn, structured-output validity rate, cache hit rate. These are cheap to compute over every span, and the Kolmogorov-Smirnov test on the per-window distribution against the reference distribution is the alerting signal. A 10-point drop in tool-call frequency is the kind of failure that doesn’t show in any quality metric until support escalates it, but shows immediately in the KS test on tool-use rate. Output drift is the easiest axis to instrument and the easiest to miss because none of these signals are quality measurements — they’re correlates of quality, and a system that monitors only the judged quality score will see the regression two days after the cheap output-feature signal would have fired.

Concept drift. Whether the relationship between input and output is what it used to be — the actual quality measurement. The shadow at this layer is the judged score on a fixed evaluation slice, run continuously over a rolling sample of production traces. Sample 1% of traces per category per hour, run the pinned judge against them, compute the per-category mean score and bootstrapped confidence interval, alert when the current-window CI lower bound falls below the reference-window CI upper bound. This is the most expensive signal — every alert costs you the judge-call price for the sampled traces, times the categories, times the windows — but it’s the only one that measures the thing you care about. The cheaper signals on the other two axes are correlates; this one is the target.

The discipline is to gate the expensive signal on the cheap ones. The hourly judge-replay runs against full samples; the category-PSI and centroid-distance checks run continuously and, when they fire, trigger an immediate judge re-score on the drifted slice. The combination — continuous cheap monitoring plus on-demand expensive judging — keeps the bill bounded while keeping the latency to alert under an hour for the drift modes that matter.

Mechanics: statistical tools that transfer

A handful of statistical tools cover the LLM drift surface; pick by data shape.

Population Stability Index (PSI) for categorical distributions — error-analysis category mix, tool-call distribution, finish-reason distribution. PSI is Σ (current_pct - reference_pct) * ln(current_pct / reference_pct) over the bins. Material drift threshold ~0.2, alert threshold ~0.1, fine-grained inspection above 0.05. Cheap; runs on aggregates.
Kolmogorov-Smirnov (KS) test for continuous distributions — output length, latency, per-span cost, judge score. KS reports the max absolute difference between the two empirical CDFs and a p-value; the p-value is the alert signal but the effect size (the D statistic) is what matters at scale. Cheap; runs on samples.
Kullback-Leibler (KL) divergence for probability distributions where you have explicit probabilities — token-level distributions if you have logprobs, soft-label outputs from a classifier head. KL is Σ p(i) * log(p(i) / q(i)); it’s asymmetric, and the symmetric variant (Jensen-Shannon) is the more useful drift metric in practice. Cheaper than KS; needs probabilities.
Embedding centroid distance. Euclidean or cosine distance between current-window and reference-window mean embeddings. The bare metric is the cheap version; cluster-aware variants (compute per-cluster centroids after HDBSCAN, alert when any cluster’s centroid moves or a new cluster appears with mass above a threshold) are the production version. Arize’s documented implementation is the cleanest reference.
Paired-bootstrap confidence intervals for regression testing — the right tool when you’re comparing two systems on the same fixed inputs. Sample with replacement from the golden-set rows, recompute the per-system mean score on each bootstrap sample, take the 2.5th and 97.5th percentiles of the difference distribution as the 95% CI on the delta. A non-zero-containing CI is the significance gate. Bootstrap because the per-row scores aren’t normally distributed and t-tests overstate confidence.
Domain-classifier drift. Evidently’s approach for raw text: train a binary classifier to distinguish reference-window text from current-window text; if the classifier hits ROC AUC > ~0.7, the distributions are separable and drift is happening. Useful when the input doesn’t reduce cleanly to categories or embeddings (e.g. raw markdown documents, long-form code).

Pick the tool by the data shape, not by what’s fashionable. Don’t reach for KL divergence on a categorical mix where PSI is the bog-standard answer, and don’t run a KS test on a 7-bin distribution where PSI is more interpretable.

Mechanics: the regression-testing protocol for model upgrades

A model upgrade — Sonnet 4.6 to Opus 4.7, Haiku 4.5 to whatever Anthropic ships next — is the most common shipping event that breaks an LLM app, because the provider’s pre-release evals tell you about their benchmarks, not yours. Anthropic’s recent deprecation cadence — Sonnet 4 and Opus 4 retired June 15, 2026; Sonnet 4.5 cutover by May 18 — means most production teams will run this protocol every few months for the lifetime of the application. The protocol that holds up has four phases.

Phase 1: offline regression on the golden set. Run the golden set against both the baseline model and the candidate. Score every row with the same pinned judge (the LLM-as-judge article is load-bearing here — judge calibration changes invalidate the comparison). Compute paired-bootstrap CIs on the per-category quality delta. The gate: every category’s delta CI either contains zero (no regression) or is strictly above zero (improvement); any category whose CI is strictly negative is a regression and the candidate fails the gate. Aggregate-only gating misses category-level regressions; gate per category.

Phase 2: shadow run on historical production traffic. Pull the last 7 days of production traces, replay the inputs through the candidate without serving the outputs, and run the same judge against both baseline outputs (already scored, in the trace store) and candidate outputs. The signal here is two-fold: the same paired-bootstrap CI on per-category deltas (now over real production distribution, not the curated golden set), and a behavioral diff — output length, tool-call frequency, refusal rate, structured-output validity — that catches behavioral regressions the judge isn’t sensitive to. A model that emits 40% more tokens to say the same thing is a cost regression even if the quality is identical, and the behavioral diff is what surfaces it.

Phase 3: canary deploy. Route 1% of live traffic to the candidate, monitor for an hour, judge a 10% sample, gate on the same per-category CI test. If it passes, step to 5%, monitor for an hour, judge. Continue to 20%, 50%, 100% with the same checkpoint at each stop. The increments and dwell times are the progressive-delivery primitives borrowed verbatim from microservices; the only LLM-specific adjustment is that the per-window judged sample is the gate, not request-level quality. A clean canary fails into rollback by flipping a single config flag, which is why the routing has to be a config knob, not a deploy step.

Phase 4: dual-run window and decommission. After cutover, keep the baseline warm for at least a week — same prompt, same routing infrastructure, just not the default route. The dual-run window catches the slower drift modes that the canary missed: user behavior changing in response to the new model’s verbosity, downstream consumers depending on a specific output shape, cache patterns that the canary’s short window didn’t exercise. After the dual-run window passes clean, decommission the baseline and reclaim the resources; the provider’s pricing often makes the dual-run window a small cost relative to the rollback insurance it buys.

A subtler point. The protocol works for model upgrades and prompt edits with no modification — both are changes to the per-call surface and both fail in the same way. It does not directly cover retrieval changes (a new chunking strategy, a re-embedded vector store, a reranker swap) because the change affects the context the model sees and the offline golden set’s recall metrics matter more than the judged output. The RAG evaluation article is the right specialization there; the protocol structure is the same, the metrics are different. Similarly, embedding-model upgrades need the dual-index migration pattern; the per-call regression protocol catches the symptom but not the cause.

Code: an embedding-drift detector over a trace store in Python

The detector below pulls user messages from the Langfuse trace store, embeds them, computes PSI over the categorical mix, and centroid-distance over the embeddings, and emits a structured drift report. Install: pip install langfuse openai numpy scipy.

python

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
# pip install langfuse openai numpy scipy
import os
from datetime import datetime, timedelta, timezone
from collections import Counter

import numpy as np
from langfuse import get_client
from openai import OpenAI
from scipy.stats import ks_2samp

langfuse = get_client()
openai_client = OpenAI()
EMBED_MODEL = "text-embedding-3-large"


def fetch_user_messages(start: datetime, end: datetime, limit: int = 5000):
    """Pull user-turn inputs from the trace store as (text, category, output_len) tuples."""
    traces = langfuse.api.trace.list(
        from_timestamp=start, to_timestamp=end, limit=limit,
    ).data
    rows = []
    for t in traces:
        if not t.input:
            continue
        text = str(t.input.get("user_msg", ""))
        category = (t.metadata or {}).get("category", "unknown")
        output_len = len(str(t.output or ""))
        rows.append((text, category, output_len))
    return rows


def psi(reference: list, current: list, bins: int | None = None) -> float:
    """Population Stability Index over categorical mixes.

    For a categorical input (list of labels), bins is the union of labels.
    For a continuous input, pass bins as an integer and we histogram.
    """
    if bins is None:
        # Categorical: bins are the union of unique values.
        labels = sorted(set(reference) | set(current))
        ref_counts = Counter(reference)
        cur_counts = Counter(current)
        ref_total = max(1, len(reference))
        cur_total = max(1, len(current))
        psi_total = 0.0
        for label in labels:
            ref_pct = (ref_counts[label] + 1e-6) / ref_total  # smoothing
            cur_pct = (cur_counts[label] + 1e-6) / cur_total
            psi_total += (cur_pct - ref_pct) * np.log(cur_pct / ref_pct)
        return float(psi_total)
    # Continuous: equal-width bins over the reference range.
    edges = np.linspace(min(reference), max(reference), bins + 1)
    ref_hist, _ = np.histogram(reference, bins=edges)
    cur_hist, _ = np.histogram(current, bins=edges)
    ref_pct = (ref_hist + 1e-6) / max(1, ref_hist.sum())
    cur_pct = (cur_hist + 1e-6) / max(1, cur_hist.sum())
    return float(np.sum((cur_pct - ref_pct) * np.log(cur_pct / ref_pct)))


def embed_batch(texts: list[str]) -> np.ndarray:
    """Batch-embed texts; production should retry/backoff."""
    resp = openai_client.embeddings.create(model=EMBED_MODEL, input=texts)
    return np.array([e.embedding for e in resp.data])


def centroid_distance(ref_emb: np.ndarray, cur_emb: np.ndarray) -> float:
    """Cosine distance between the two centroids."""
    ref_c = ref_emb.mean(axis=0)
    cur_c = cur_emb.mean(axis=0)
    cos = np.dot(ref_c, cur_c) / (np.linalg.norm(ref_c) * np.linalg.norm(cur_c))
    return float(1.0 - cos)


def run_drift_check(reference_days: int = 7, current_hours: int = 24) -> dict:
    end = datetime.now(timezone.utc)
    ref_start = end - timedelta(days=reference_days) - timedelta(hours=current_hours)
    ref_end = end - timedelta(hours=current_hours)
    ref = fetch_user_messages(ref_start, ref_end)
    cur = fetch_user_messages(ref_end, end)
    if not ref or not cur:
        return {"status": "insufficient_data", "ref_n": len(ref), "cur_n": len(cur)}

    # Category PSI
    cat_psi = psi([r[1] for r in ref], [c[1] for c in cur])

    # Output-length KS
    ks_stat, ks_p = ks_2samp([r[2] for r in ref], [c[2] for c in cur])

    # Embedding centroid distance — sample to bound cost
    ref_sample = [r[0] for r in ref[: min(500, len(ref))]]
    cur_sample = [c[0] for c in cur[: min(500, len(cur))]]
    ref_emb = embed_batch(ref_sample)
    cur_emb = embed_batch(cur_sample)
    cent_dist = centroid_distance(ref_emb, cur_emb)

    # Decision: any single signal can fire, but the action is "investigate the slice".
    alerts = []
    if cat_psi > 0.2:
        alerts.append(f"category_psi={cat_psi:.3f} (material)")
    elif cat_psi > 0.1:
        alerts.append(f"category_psi={cat_psi:.3f} (worth_glance)")
    if ks_p < 0.01 and ks_stat > 0.1:
        alerts.append(f"output_length_ks_D={ks_stat:.3f}, p={ks_p:.4f}")
    if cent_dist > 0.05:
        alerts.append(f"embedding_centroid_distance={cent_dist:.4f}")

    return {
        "status": "drift" if alerts else "stable",
        "alerts": alerts,
        "category_psi": cat_psi,
        "output_length_ks_stat": float(ks_stat),
        "output_length_ks_p": float(ks_p),
        "embedding_centroid_distance": cent_dist,
        "n_reference": len(ref),
        "n_current": len(cur),
    }


if __name__ == "__main__":
    import json
    print(json.dumps(run_drift_check(), indent=2))

Three things to flag. First, the three signals — category PSI, output-length KS, embedding-centroid distance — are independent. Any one firing is enough to investigate; the conjunction is the strong signal. Second, the thresholds are starting points (PSI 0.2 / 0.1 from the scorecard literature, KS p<0.01 and D>0.1 to filter weak effects); recalibrate against your own false-positive rate after a month. Third, the embedding step is sampled to 500 messages per window — at $0.13 per million tokens for text-embedding-3-large and ~80 tokens per message, an hourly run costs around $0.50 a day, which is the cheap shadow signal. The expensive judged signal sits behind this and runs only when an alert fires.

Code: a paired-bootstrap regression gate in TypeScript

The harness below runs a candidate model and a baseline model against the same golden set, scores every row with a pinned judge, and emits a per-category bootstrap CI on the quality delta. Install: npm install ai @ai-sdk/anthropic @ai-sdk/openai zod.

typescript

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
// npm install ai @ai-sdk/anthropic @ai-sdk/openai zod
import { generateText, generateObject } from "ai";
import { anthropic } from "@ai-sdk/anthropic";
import { z } from "zod";

interface GoldenRow {
  id: string;
  category: string;
  input: string;
  expected?: string; // optional reference output
}

interface ScoredRow {
  id: string;
  category: string;
  baseline_score: number;
  candidate_score: number;
  delta: number;
}

const JUDGE = anthropic("claude-haiku-4-5"); // pinned

const VerdictSchema = z.object({
  reasoning: z.string(),
  score: z.number().int().min(1).max(3),
});

const RUBRIC = `Score the response 1-3 against the rubric (faithfulness + helpfulness against the input). Reason step by step before the score.`;

async function judge(input: string, output: string): Promise<number> {
  const { object } = await generateObject({
    model: JUDGE,
    schema: VerdictSchema,
    prompt: `${RUBRIC}\n\nInput:\n${input}\n\nOutput:\n${output}`,
  });
  return object.score;
}

async function runSystem(
  model: ReturnType<typeof anthropic>,
  input: string,
): Promise<string> {
  const { text } = await generateText({
    model,
    messages: [{ role: "user", content: input }],
    // The system prompt and tool surface should match production exactly;
    // omitted here for brevity.
  });
  return text;
}

async function scoreSuite(
  rows: GoldenRow[],
  baseline: ReturnType<typeof anthropic>,
  candidate: ReturnType<typeof anthropic>,
): Promise<ScoredRow[]> {
  const scored: ScoredRow[] = [];
  for (const row of rows) {
    const [baseOut, candOut] = await Promise.all([
      runSystem(baseline, row.input),
      runSystem(candidate, row.input),
    ]);
    const [baseScore, candScore] = await Promise.all([
      judge(row.input, baseOut),
      judge(row.input, candOut),
    ]);
    scored.push({
      id: row.id,
      category: row.category,
      baseline_score: baseScore,
      candidate_score: candScore,
      delta: candScore - baseScore,
    });
  }
  return scored;
}

function bootstrapCI(
  deltas: number[],
  iterations = 5000,
  alpha = 0.05,
): { mean: number; lower: number; upper: number } {
  if (deltas.length === 0) return { mean: 0, lower: 0, upper: 0 };
  const sampleMean = (xs: number[]): number =>
    xs.reduce((a, b) => a + b, 0) / xs.length;
  const means: number[] = [];
  for (let i = 0; i < iterations; i++) {
    const sample = Array.from(
      { length: deltas.length },
      () => deltas[Math.floor(Math.random() * deltas.length)],
    );
    means.push(sampleMean(sample));
  }
  means.sort((a, b) => a - b);
  const lo = means[Math.floor(iterations * (alpha / 2))];
  const hi = means[Math.floor(iterations * (1 - alpha / 2))];
  return { mean: sampleMean(deltas), lower: lo, upper: hi };
}

function gateByCategory(scored: ScoredRow[]): {
  passed: boolean;
  per_category: Record<string, { mean: number; lower: number; upper: number; n: number; verdict: string }>;
} {
  const byCat: Record<string, number[]> = {};
  for (const r of scored) {
    (byCat[r.category] ||= []).push(r.delta);
  }
  const per_category: Record<string, { mean: number; lower: number; upper: number; n: number; verdict: string }> = {};
  let passed = true;
  for (const [cat, deltas] of Object.entries(byCat)) {
    const ci = bootstrapCI(deltas);
    let verdict: string;
    if (ci.upper < 0) {
      verdict = "regression"; // CI strictly below zero
      passed = false;
    } else if (ci.lower > 0) {
      verdict = "improvement";
    } else {
      verdict = "no_change";
    }
    per_category[cat] = { ...ci, n: deltas.length, verdict };
  }
  return { passed, per_category };
}

async function main() {
  const goldenSet: GoldenRow[] = [
    // 50-500 rows curated from error-analysis categories
    { id: "1", category: "order_status", input: "Where is order 12345?" },
    // ...
  ];
  const baseline = anthropic("claude-sonnet-4-5");
  const candidate = anthropic("claude-sonnet-4-6");
  const scored = await scoreSuite(goldenSet, baseline, candidate);
  const gate = gateByCategory(scored);
  console.log(JSON.stringify(gate, null, 2));
  if (!gate.passed) {
    console.error("Regression detected — do not promote.");
    process.exit(1);
  }
}

main().catch((e) => {
  console.error(e);
  process.exit(1);
});

Three things to flag. First, the gate is category-conditional — any single category whose 95% CI is strictly below zero is a regression, even if the aggregate looks fine. Second, the bootstrap uses paired deltas, not marginal scores, because the same input has correlated noise across the two systems and ignoring the pairing inflates the CI. Third, the judge is pinned (claude-haiku-4-5) — the protocol explicitly does not change the judge model when comparing the two systems, because judge calibration drift would corrupt the comparison.

A common extension is to compute the same gate for cost and latency deltas. A candidate that improves quality by 0.05 and triples latency is a different shipping decision than one that improves quality by 0.05 at constant latency; the gate should expose both Pareto dimensions, not just the quality axis.

Tooling for steady-state drift detection

Three platforms anchor the production conversation in 2026; the right one depends on what’s already wired in.

Arize Phoenix — the open-source observability stack with the deepest embedding-drift surface. HDBSCAN clustering, UMAP projection, centroid-distance trend lines, and per-cluster judge-replay are first-class features. The right choice when drift detection is the operational pain point.
Evidently — open-source ML+LLM observability with text-data drift detection (domain-classifier method) and 100+ built-in metrics including PSI, KS, JSD. Heavier on the structured-data side than Phoenix, but the LLM evals product is competitive and the report-generation UX is the cleanest in the space.
Langfuse / LangSmith — the trace stores that anchor the observability article both ship score-trend monitoring and alert pipelines; the drift signal here is judged-score-over-time rather than input-distribution shift, which means it sees concept drift well and input/output drift only indirectly. Pair with a Phoenix or Evidently component for the cheap shadow signals.

The decision rule: pick the steady-state drift detector by which axis is the operational pain. Input-distribution shift (the marketing-push failure mode) → Phoenix-style embedding analytics. Output-feature shift (the silent-model-upgrade mode) → Evidently-style descriptor monitoring or a custom KS pipeline. Concept drift (the judge-score-trending-down mode) → the trace-store-native score-trend monitor in Langfuse or LangSmith, gated by the judge protocol from the LLM-as-judge article. Most production stacks end up wiring one of each.

Trade-offs, failure modes, gotchas

Don’t alert on every drift signal. Investigate. A drift signal is a hypothesis that something changed; it is not a verdict that quality regressed. The right escalation chain is: cheap signal fires → automated targeted judge re-score on the drifted slice → judge confirms quality regression → human pages. Pipelines that page on the cheap signal directly burn out the on-call within a month and end up muted; pipelines that wait for the judge to confirm pre-filter the false positives. The cheap signal’s job is to prioritize the expensive judge’s work, not to alert humans directly.

Refresh the reference window on a schedule, but freeze it during investigation. The reference window’s job is to be the stable comparison; if it drifts along with current production, the detector silently goes blind. Refresh weekly (or after a known intentional change like a prompt edit lands), but freeze the reference whenever an alert is open so the investigation has a stable baseline. The single biggest source of “we silenced the alert and forgot to fix the bug” is a rolling reference window that absorbs the regression into the new normal.

Judge calibration drift will look like concept drift. When the judge model is silently upgraded by the provider, the score distribution shifts and the steady-state monitor screams concept drift. The signal is real (the judged score did change), but the cause is the instrument, not the system. The mitigation is to pin the judge model — explicitly version it in the eval config — and to recompute the reference baseline whenever the judge changes, treating the upgrade as a metric reset. The LLM-as-judge article covers this in depth; the operational hook here is that the drift detector should know about judge-version changes and pause alerting through them.

Embedding drift in the monitoring layer is its own failure mode. The embedding model you use to compute centroid distance can itself be upgraded, and a v1 reference centroid against a v2 current centroid will register massive “drift” that is purely the instrument changing. Pin the monitoring embedding model the same way you pin the judge, and treat its upgrade as a metric reset. The memory-conflict-and-forgetting article covers the equivalent migration pattern for retrieval embeddings; the monitoring layer has the same problem at smaller scale.

Per-category sample sizes get tiny fast. A 24-hour window with 1000 traces split across 12 categories is ~80 traces per category, which gives wide bootstrap CIs and weak signal per category. Two mitigations: lengthen the window for low-volume categories (compute 7-day windows for tail categories, 24-hour for head categories), and merge adjacent categories where the failure modes are similar. Don’t ignore a low-volume category just because the window is too small — the bootstrap CI will tell you the signal is weak, and the right reaction is to widen the window, not to ignore the alert.

Shadow-mode cost is real. Running every production input through both the baseline and the candidate doubles your inference spend during the shadow window. Mitigate by sampling — shadow 10% of traffic instead of 100% — and by setting an explicit budget on the shadow window (a week, a month). Indefinite shadow runs are an antipattern; they accumulate cost without producing additional signal after the first few days of dwell.

Canary blast-radius depends on which dimension you sample on. Routing 1% of traffic to a candidate is the typical canary; routing 1% of users — same user always hits the same arm — is what enables per-user A/B analytics and prevents user-visible flip-flopping. Choose by which analysis you’ll actually need; a hash-on-user-id router is the right default for production LLM systems and avoids the “the assistant talks differently every other turn” failure mode that traffic-sampled canaries can produce when a session spans multiple requests.

Drift doesn’t mean rollback by default. Sometimes the drift signal is real and the right action is to update the reference — the user mix legitimately shifted and the system is handling it fine. The decision rule: if the judged quality on the new slice meets or exceeds the reference’s quality threshold, the new distribution is the new normal and you refresh the reference. If the judged quality regresses, the drift caused harm and the upstream change needs to be rolled back. The detector flags the drift; the judge decides the verdict.

Production traffic doesn’t replay deterministically. Shadow-replaying yesterday’s traces through a candidate model gives you the candidate’s outputs against yesterday’s inputs, but the judge’s score on those outputs may not match the judge’s score the candidate would have received in live serving — the judge sees the input/output pair, not the live retrieval state, the cache state, or the conversation context. Replay is a useful approximation, not a perfect simulation; treat it as the cheap pre-canary check, not the canary substitute.

What to read next

Human-in-the-Loop Feedback Loops for LLM Systems — the annotation loop that follows the drift alert. When the detector surfaces a degraded slice, the annotation queue is what tells you why and turns the answer into a structural fix; the drift signal is the priority signal, and the human label is the verdict.
Cost Optimization and Model Routing — the production system that drift detection protects. Routers are themselves models that drift against provider model upgrades; the drift detector is the alarm that says “your router’s miscalibrated on the new tier.”
LLM-as-Judge: Pointwise and Pairwise — the pinned-judge discipline that makes drift signals comparable across time and the both-orderings protocol that survives the noise floor. Judge calibration is a prerequisite for trustworthy drift detection.
Memory Conflict, Forgetting, and Embedding Drift — the deeper treatment of embedding-model drift on the retrieval side, including the dual-index migration pattern and Drift-Adapter affine maps. The monitoring-layer drift in this article surfaces the symptom; that article handles the cause when the embedding model itself changes.