jatin.blog ~ $
$ cat ai-engineering/llm-as-judge.md

LLM-as-Judge: Pointwise and Pairwise

How LLM-as-judge works in production: rubrics, pointwise vs pairwise, position/verbosity/self-preference bias, and how to calibrate against humans.

Jatin Bansal@blog:~/ai-engineering$ open llm-as-judge

A team ships a rubric-based judge that grades helpfulness 1–5. The first week, every PR’s mean score lands between 3.6 and 3.9 — the suite looks healthy, the gate is green, the model upgrade ships. Two weeks later support escalates a wave of complaints about the assistant being curt and unhelpful, and a closer look at the traces shows the upgraded model emits half the tokens for the same questions. The judge had been counting tokens as effort and effort as quality; the verbosity bias was scoring length, not helpfulness, and the new model lost that bias dividend. Nobody on the team had calibrated the judge against a human reviewer in three months. The judge had drifted into vanity, and the eval gate was meaningless before anyone noticed.

Opening bridge

Yesterday’s piece on eval-driven development sketched the three-tier pyramid — deterministic checks, code-based assertions, LLM-judged checks — and stopped short of the top tier. Today’s article opens it up. The LLM-judged layer is the one that handles open-ended quality (faithfulness, helpfulness, tone, instruction-following) where no regex or schema applies, and it is the layer whose calibration drifts in ways the cheap layers don’t. Every memory framework comparison in the memory evaluation article — Mem0 at 94.4% on LongMemEval, Zep at 71.2% — sat on top of an LLM judge; every faithfulness number in the RAG evaluation piece was a judge call. This piece is the load-bearing internals: how the judge actually decides, where it goes wrong, and the discipline that keeps it useful.

Definition

LLM-as-judge is the practice of using one language model to score, rank, or critique the output of another language model, against a rubric, with results aggregated across a fixed eval set. Three properties separate it from “ask GPT what it thinks.” First, the rubric is explicit and versioned — the judge prompt is the metric, and changing it invalidates historical comparisons the way changing a database schema invalidates a query. Second, the judge is calibrated against a human reference — judge–human agreement on a held-out set is the only number that says the judge is measuring the right thing. Third, the bias surface is treated, not ignored — position, verbosity, self-preference, and length biases appear in every untreated pipeline, and the production protocol bakes in mitigations rather than handwaving them.

The technique was canonized by Zheng et al.’s “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena”, which established that a strong judge model reaches ~85% agreement with human experts on MT-Bench using pairwise comparison — at parity with the ~81% human–human agreement on the same task. The paper also catalogued the bias modes that the rest of the field has been mitigating since. The complementary lineage is G-Eval, which formalized chain-of-thought rubric grading with token-probability-weighted continuous scores and is the template most modern pointwise harnesses descend from.

Intuition

The mental model that pays off: the judge is a statistical instrument with a measurement error and a bias profile, not a ground truth. You treat it the way you’d treat a thermocouple in an industrial process — calibrate it against a reference, characterize its drift, account for its noise floor, and never compare readings across instruments without a recalibration window. Two judges scoring “faithfulness 0.78” are not measuring the same thing; two runs of the same judge on the same input two weeks apart may not be measuring the same thing either, if the judge model was upgraded silently between runs.

The right unit of analysis is the delta between systems on the same fixed judge, not the absolute number. A system that improves from 0.74 to 0.81 under a pinned judge is a real signal. A system that scores 0.81 under Claude Sonnet and 0.86 under GPT — those are not comparable; the second judge may simply be more lenient or favor its own family’s outputs. The discipline is to fix the judge for the duration of an A/B and recompute baselines when the judge changes, the same way you recompute baselines when you change a measurement sensor.

The flipside of the instrument framing: a judge is the right tool when the property being measured is open-ended, agreed by humans to be ambiguous, and not captured by cheaper deterministic checks. The cost ladder still applies — every property that admits a regex, a schema, or a string match belongs at the deterministic layer; the judge is reserved for what only humans could otherwise score.

The distributed-systems parallel

The cleanest analogue is the canary–baseline–holdout protocol from progressive delivery. A canary deploy at 1% measures relative to the baseline 99% under identical traffic; the baseline is the calibration reference, and the canary’s metric only has meaning against that baseline. Pairwise LLM judging is the same shape: the judge sees output A and output B from the same input and emits “A wins” or “B wins”; the comparison cancels out a lot of common-mode noise (rubric interpretation, judge sampling variance, query ambiguity) because both sides are exposed to the same conditions. Pointwise scoring is closer to monitoring an absolute SLI — you measure each request against a threshold, in isolation, and the absolute value matters because the threshold matters. Both shapes have a place; the trade-off is the same one canary deploys make.

The deeper parallel is closer to measurement uncertainty in physical instrumentation. Every reading carries a confidence interval; pretending the point estimate is the truth is how teams ship regressions disguised as improvements. The judge’s noise floor — measured by running the same input through the judge five times and computing the score spread — is the floor below which differences are uninterpretable. A 0.85 vs 0.86 win is meaningless if the judge’s per-sample standard deviation is 0.04; a 0.85 vs 0.91 win is real even with the same noise floor. Production teams bootstrap confidence intervals over the eval set and gate on the lower bound of the better system exceeding the upper bound of the worse, not on point estimates.

A real disanalogy worth flagging: physical instruments have stationary error models. LLM judges don’t. A judge based on Claude Sonnet 4.6 has a different bias profile from one based on GPT-5.5 and a different one again after the next minor version upgrade. The calibration window is not optional and is not done once; it is part of the recurring cost of the eval suite.

Mechanics: pointwise scoring

The pointwise judge takes a single input/output pair and emits a score against a rubric. The implementation is mechanically simple — a prompt that contains the rubric, the input, the candidate output, and an instruction to emit a verdict in a structured form. The hard parts are the rubric and the calibration. The rubric collapses if it tries to cover too much at once; “rate the helpfulness of this response on a 1–5 scale” gives you noise because helpfulness dissolves into half a dozen sub-properties that the judge weights inconsistently across calls.

The discipline the field has converged on, traced back to Hamel Husain’s work and the G-Eval paper, looks like this:

  1. Criterion-separated. One judge prompt per property — faithfulness, relevance, instruction-following, tone — not one prompt that scores all four. Each prompt’s rubric is two or three sentences naming the property and the failure modes.
  2. Binary or low-cardinality. Pass/fail or 1–3 beats 1–5, which beats 1–10. The reason is calibration: a binary judge is the easiest to align with a human reviewer because there are only two cells in the confusion matrix. Hamel argues — from experience across 30+ companies — that domain-expert pass/fail correlates better with actual quality than granular numeric scores, and pass/fail is harder to game with verbosity. The flip side is that pass/fail loses ordinal information; if you need to see “approaching threshold” trends, a 1–3 scale with anchored examples per level is a defensible middle ground.
  3. Anchored examples. The rubric includes one or two example outputs at each score level, drawn from your own traces. This is the equivalent of rater training in human evaluation: the judge needs to see what a “3” looks like in your domain, not in the abstract.
  4. Chain-of-thought first, verdict last. The judge writes its reasoning before its verdict. G-Eval’s central trick is to ask the judge to enumerate the evaluation steps in CoT form before emitting the score; the score-after-reasoning ordering improves human correlation significantly compared to score-then-justify. Modern pointwise harnesses adopt this universally.
  5. Token-probability scoring (optional). When the judge emits a single token score (e.g. “3”), the logits over the score tokens can be re-weighted to produce a continuous score: score = Σ p(k) · k for k in the score range. G-Eval shows this reduces score quantization noise and improves correlation with human ratings. This requires logprobs access — supported in the OpenAI SDK directly and now exposed as logprobs=True in the Anthropic Messages API — and is most useful for trend-tracking, less so for hard gates.

The pointwise judge’s failure mode is scale drift: the same property scored across two judge upgrades is not on the same scale, and historical comparisons silently lie unless you pin the judge or recalibrate. The cheap mitigation is to pin the judge model in the eval config (claude-haiku-4-5 is fine for most rubrics) and treat a judge upgrade as a metric reset.

Mechanics: pairwise comparison

The pairwise judge takes two candidate outputs A and B for the same input and emits “A wins,” “B wins,” or “tie.” The protocol comes directly from Chatbot Arena, where pairwise crowd preferences feed Elo ratings, and from MT-Bench, where pairwise judge calls fed the agreement studies above. Pairwise is the right shape when you want to compare two systems against each other — model A vs model B, prompt v1 vs prompt v2 — and when the property is too ambiguous for a stable absolute rubric.

Pairwise’s reliability advantage is real. The judge does not have to map an output onto an absolute scale; it only has to decide which of two outputs is better given the same context. That cancels out a lot of rubric-interpretation noise, which is why pairwise GPT-4 hits 85% human agreement on MT-Bench. But pairwise has two costs: quadratic call count — comparing N systems requires N·(N-1)/2 pairs per query — and position bias, which dominates the noise budget if untreated.

Position bias is the judge’s tendency to prefer the first option (or, depending on the model, the last) regardless of content. Zheng et al. reported Claude-v1 with 70% one-sided preference and GPT-3.5 with 50% positional bias on MT-Bench; recent frontier-model studies show the effect persists, with position bias remaining a 40% inconsistency rate for GPT-4-class judges on rubric-based pairwise tasks. The mitigation is non-optional and adds a 2x cost: run each comparison twice with positions swapped, and only count a “win” when both orderings agree. Disagreements become ties. This is the canonical “both-orderings” protocol and is what every production pairwise pipeline implements.

The second pairwise gotcha is calibration of partial wins. Two orderings that agree on “A wins” are a clean win for A. Two orderings that flip — A wins in (A, B) and B wins in (B, A) — are not really a tie; they’re a judge whose verdict was determined by position, not content. The instrumented protocol counts these separately as “position-flipped” cases and gates the eval’s reliability on the fraction below ~10%. If you see 30% position-flipped cases, the judge is too noisy for pairwise on this rubric and you either fall back to pointwise or switch judge models.

Mechanics: the bias surface

Four biases appear in every untreated judge pipeline, and the discipline is to treat each one with an explicit mitigation rather than trust the judge’s calibration:

  • Position bias. Treated by both-orderings, above.
  • Verbosity bias. The judge prefers longer outputs even when the shorter one is more correct — length reads as effort, effort reads as quality. Eugene Yan’s survey reports Claude-v1 and GPT-3.5 preferring longer responses over 90% of the time on the verbosity probe. The mitigation is a rubric that explicitly rewards conciseness when the shorter response covers the same ground, plus length-normalized scoring on rubrics where length and quality are orthogonal. Don’t trust the judge to detect verbose padding on its own.
  • Self-preference bias. A judge from a model family scores outputs from its own family higher than outputs from other families. Zheng et al. measured GPT-4 with 10% self-enhancement and Claude-v1 with 25% self-preference. The mitigation is cross-family judging — use a judge from a different family than the systems being compared, or use a panel of judges (more below).
  • Length / authority / format biases. Catch-all category: the judge prefers outputs that look authoritative (citations, hedged claims), follow a familiar format (bullet lists over prose, even when prose is better), or contain technical jargon. These biases are domain-specific; the mitigation is to detect them via the calibration loop — if the judge and a domain expert disagree systematically on a category of output, that’s a bias signal, and the rubric needs an explicit constraint to counteract it.

These four are not a complete list; the Survey on LLM-as-a-Judge catalogues a dozen more. But the four above are the load-bearing ones in practice — treat them and the rest is small noise on top.

Mechanics: panel of judges (PoLL)

A panel of smaller judges from disjoint model families outperforms a single large judge on most tasks, costs less, and has lower intra-model bias. Verga et al. (“Replacing Judges with Juries”) measured a three-judge panel of smaller models (Command-R, Haiku, GPT-3.5) beating GPT-4 as a single judge on six datasets while costing 7x less. The mechanism is two-fold: cross-family panels neutralize each model’s self-preference, and averaging over independent estimators reduces the per-call variance the same way ensemble methods do everywhere else.

The trade-off is operational: more judge clients to manage, more credentials, more API surface to fail on. Production teams ship panels for the expensive nightly judged sweep where the variance reduction matters; the cheap per-PR layer often runs with a single pinned judge for simplicity. The cleanest panel is three judges from three families — pick one Anthropic, one OpenAI, one open-weights (e.g. a Llama or Qwen variant). Aggregate by majority vote for binary verdicts and by mean for continuous scores.

Mechanics: calibration against humans

The judge has no meaning until you’ve measured its agreement with a domain expert on a held-out calibration set. The protocol is:

  1. Sample 100–200 traces stratified by category from production or from the existing eval set.
  2. Have one domain expert — not three, not a committee — score every sample on the rubric, using the same prompt the judge will see. The single-expert constraint is from Hamel’s work: averaging over multiple humans introduces variance that swamps the signal at small sample sizes.
  3. Score the same samples with the candidate judge.
  4. Compute Cohen’s kappa between expert and judge for categorical verdicts (or Spearman/Kendall correlation for continuous scores). Plain percentage agreement overstates: an 80% percentage agreement can correspond to a 0.62 kappa, which is fair but not strong (Eugene Yan reports exactly this gap on TriviaQA with Llama-3-8b as judge).
  5. Iterate on the rubric until kappa exceeds ~0.7 on the calibration set, or — if the underlying property is genuinely ambiguous — accept that the judge has a floor and switch to a different property.

Re-run the calibration on a fresh sample monthly, or after any judge-model upgrade, prompt change, or rubric edit. The judge’s agreement number is the metric that says the metric is trustworthy.

Code: a pointwise judge with both-orderings adapter

The Python harness below pins the judge model, runs the rubric in chain-of-thought mode, parses a structured verdict via Pydantic, and exposes a pairwise shim that wraps two pointwise calls. Install: pip install anthropic pydantic.

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
# pip install anthropic pydantic
from anthropic import Anthropic
from pydantic import BaseModel, Field
import json

JUDGE_MODEL = "claude-haiku-4-5"  # pinned; treat upgrade as metric reset
client = Anthropic()

class Verdict(BaseModel):
    reasoning: str = Field(..., description="Step-by-step rubric reasoning before the score.")
    score: int = Field(..., ge=1, le=3, description="1=fail, 2=partial, 3=pass")

POINTWISE_PROMPT = """\
You are an evaluator. Score the response against the rubric below.

Rubric (faithfulness): the response makes only claims that are explicitly supported by the provided context. Score:
- 3 (pass): every claim supported, no extrapolation
- 2 (partial): one claim under-supported but the rest are clean
- 1 (fail): hallucinated or contradicted claims

Anchored examples:
- A 3 looks like: <one trace from your domain, redacted>
- A 1 looks like: <one trace, redacted>

Reason step by step first, then emit a JSON object: {"reasoning": "...", "score": N}.

Context:
{context}

Response:
{response}
"""

def pointwise_judge(context: str, response: str) -> Verdict:
    msg = client.messages.create(
        model=JUDGE_MODEL,
        max_tokens=512,
        messages=[{
            "role": "user",
            "content": POINTWISE_PROMPT.format(context=context, response=response),
        }],
    )
    text = msg.content[0].text
    # Robust JSON extraction: find the last {...} block in the output.
    start = text.rfind("{")
    end = text.rfind("}")
    return Verdict.model_validate_json(text[start:end + 1])

# Pairwise via two pointwise calls plus the both-orderings adapter.
def pairwise_judge(context: str, response_a: str, response_b: str) -> str:
    """Returns 'A', 'B', 'tie', or 'position-flipped'."""
    a_first = _compare(context, response_a, response_b)  # asks: which is better?
    b_first = _compare(context, response_b, response_a)
    if a_first == "first" and b_first == "second":
        return "A"
    if a_first == "second" and b_first == "first":
        return "B"
    if a_first == b_first:
        return "position-flipped"  # judge prefers position, not content
    return "tie"

PAIRWISE_PROMPT = """\
You are an evaluator. Two responses to the same query are below. Decide which
better follows the rubric (faithfulness as above), or whether they're equivalent.

Reason step by step first, then emit a JSON object: {"reasoning": "...", "verdict": "first"|"second"|"tie"}.

Context:
{context}

Response 1:
{first}

Response 2:
{second}
"""

def _compare(context: str, first: str, second: str) -> str:
    msg = client.messages.create(
        model=JUDGE_MODEL,
        max_tokens=512,
        messages=[{
            "role": "user",
            "content": PAIRWISE_PROMPT.format(context=context, first=first, second=second),
        }],
    )
    text = msg.content[0].text
    start = text.rfind("{")
    end = text.rfind("}")
    return json.loads(text[start:end + 1])["verdict"]

Three things to flag. First, the pointwise rubric is criterion-separated — one property per prompt; running four properties means four calls per row. Second, the pairwise shim explicitly distinguishes “tie” from “position-flipped”; the latter is a judge-failure mode that should be surfaced in the dashboard, not buried in the tie count. Third, the JSON parsing is intentionally permissive — judges sometimes prepend prose to their JSON, and the cheapest fix is to grab the trailing brace-delimited block rather than fight the structured-output configuration. (When you do want to fight it, the structured output piece and constrained decoding piece cover the API and the grammar-based alternatives.)

Code: panel-of-judges with TypeScript and the Vercel AI SDK

For the cross-family panel, the Vercel AI SDK is the cleanest abstraction because it talks to Anthropic, OpenAI, and open-weights providers behind the same generateText interface. Install: npm install ai @ai-sdk/anthropic @ai-sdk/openai zod.

typescript
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
// npm install ai @ai-sdk/anthropic @ai-sdk/openai zod
import { generateObject } from "ai";
import { anthropic } from "@ai-sdk/anthropic";
import { openai } from "@ai-sdk/openai";
import { z } from "zod";

const VerdictSchema = z.object({
  reasoning: z.string(),
  score: z.number().int().min(1).max(3),
});

const RUBRIC = `\
Score the response against the rubric (faithfulness: the response makes only
claims supported by the provided context). 3=pass, 2=partial, 1=fail. Reason
step by step before the score.`;

interface JudgeConfig {
  name: string;
  model: ReturnType<typeof anthropic> | ReturnType<typeof openai>;
}

const PANEL: JudgeConfig[] = [
  { name: "haiku", model: anthropic("claude-haiku-4-5") },
  { name: "gpt-mini", model: openai("gpt-5.5-mini") },
  // Add a third cross-family judge (e.g. an open-weights model via a compatible provider)
];

async function panelJudge(
  context: string,
  response: string,
): Promise<{ score: number; per_judge: Record<string, number> }> {
  const calls = PANEL.map(async (j) => {
    const { object } = await generateObject({
      model: j.model,
      schema: VerdictSchema,
      prompt: `${RUBRIC}\n\nContext:\n${context}\n\nResponse:\n${response}`,
    });
    return [j.name, object.score] as const;
  });
  const results = await Promise.all(calls);
  const per_judge = Object.fromEntries(results);
  // Aggregate: median for ordinal scores; mean would also be defensible.
  const sorted = [...Object.values(per_judge)].sort((a, b) => a - b);
  const score = sorted[Math.floor(sorted.length / 2)];
  return { score, per_judge };
}

// Disagreement is itself a signal. Surface it.
async function panelJudgeWithDisagreement(context: string, response: string) {
  const { score, per_judge } = await panelJudge(context, response);
  const scores = Object.values(per_judge);
  const max = Math.max(...scores);
  const min = Math.min(...scores);
  return { score, per_judge, spread: max - min };
}

The Vercel SDK’s generateObject handles the schema-coercion plumbing across providers and is the cheapest path to a multi-provider panel. The aggregation here uses median (robust to single-judge outliers); mean is fine when scores are continuous and well-calibrated. The spread metric is the cheap signal: when the panel disagrees by 2 points on a 3-point rubric, that row needs human review — the disagreement is more informative than any individual judge’s score.

Trade-offs, failure modes, gotchas

Choose pointwise for thresholded gating, pairwise for system-vs-system A/Bs. A CI gate (“merge if faithfulness ≥ 0.85”) needs an absolute scale, so pointwise. A model-upgrade decision (“is GPT-5.5 better than Sonnet 4.6 on our workload?”) needs relative comparison without committing to an absolute scale, so pairwise. Mixing the modes confuses what the dashboard means; pick one per use case.

Use a smaller judge when you can. A small model (e.g. Claude Haiku 4.5 at $1/$5 per million input/output tokens, or a comparably-priced GPT-4.1-mini) is the default. The frontier model is needed for the hardest judging tasks — long-context faithfulness, deep code-review rubrics — and the cost ratio is significant: a 200-row suite with four rubrics at ten judge calls per row is 8K calls per run, and the difference between Haiku and Opus is roughly 60x in cost. Calibrate the small judge first; only escalate to the frontier model when kappa against the human reference is unrecoverably low.

Pin the judge. A judge upgrade — Sonnet 4.5 to Sonnet 4.6, GPT-4 to GPT-4.1, Haiku 4.3 to Haiku 4.5 — silently rescales the metric. Treat upgrades as metric resets: dual-run for a week, recompute baselines, document the cutover. The RAG-evaluation article and the memory-evaluation article both make this point for their respective domains; the same warning applies at the application layer.

Don’t let the judge grade its own family. Sonnet judging Sonnet outputs has a measurable self-preference dividend; Sonnet judging GPT outputs and vice versa is cleaner. The panel-of-judges variant fixes this structurally; if you can’t afford the panel, at least pick a judge from a different family than the systems you’re comparing.

Budget the judge cost as a first-class line item. A 500-row golden set with four rubrics, pairwise with both-orderings, is 4,000 judge calls per evaluation run. At Haiku rates with median rubric output length, that is a few dollars; at Opus rates it is tens of dollars. The cost is small per run, but it compounds across nightly sweeps and per-PR gates. The cheap-first ladder from the eval-driven development article applies — gate cheap deterministic checks on every PR, and run the expensive judged layer on a sampled subset or nightly schedule.

Beware the comfortable threshold. A judge that always returns 0.85±0.02 is a judge that does no work. Either the rubric is too easy, the threshold is uncalibrated, or the judge is averaging out the per-category failures the suite is supposed to catch. Slice the score by category every time; the aggregate is for the dashboard, the per-category numbers are for action.

Calibrate quarterly, not once. The corpus drifts; the model under test drifts; the judge drifts on its own as the provider updates the model. A calibration run that passed in February tells you nothing about the judge’s reliability in May. Schedule the human-alignment check as a recurring task.

Don’t outsource the rubric write-up. The rubric is the metric. Letting a vendor (or an LLM) draft the rubric loses the design signal that the error-analysis pass was for. The judge is your expert’s verdict scaled to thousands of rows per run; if the expert never wrote the verdict criteria themselves, the scaling is multiplying confusion.

Position-flipped rate as a judge health signal. Track the fraction of pairwise comparisons that flip under both-orderings as a separate dashboard line. A rate above 15% means the judge is unreliable on this rubric and you should either fall back to pointwise, switch the judge model, or tighten the rubric. Treating the position-flipped rate as a signal — not as ties — is what catches the failure mode where a “balanced” 50/50 result is actually 0% real wins and 100% position-determined coin flips.

Further reading

  • Eval-Driven Development for LLM Systems — the pyramid this article sits inside; the judge is the top tier, and the cost ladder of cheap-deterministic-then-judged is the load-bearing protocol.
  • Production Tracing and Observability for LLM Systems — where judge spans actually land. The judge runs offline, but it scored a specific production turn, and the trace store is the artifact the regression investigation opens first. The cross-pillar discipline that makes per-trace judged-eval results actionable.
  • Drift Detection and Regression Testing for LLM Systems — the steady-state monitor and the model-upgrade protocol that both use the pinned judge from this article as their measurement instrument. Judge calibration drift will look like concept drift unless you treat the two separately.
  • Human-in-the-Loop Feedback Loops for LLM Systems — the human-calibration loop that pins the judge’s verdict to a domain expert’s. Production user feedback, annotation queues, label hygiene, and the data flywheel that keeps the judge measuring the right thing as the corpus drifts.