$ cat ai-engineering/model-routing.md

Cost Optimization and Model Routing

How predictive routers and cascades trade model cost, quality, and tail latency per request.

Jatin Bansal@blog:~/ai-engineering$ open model-routing

Request difficulty varies within most LLM workloads. A router sends routine requests to a cheaper model and reserves an expensive model for requests that need it, provided the routing error rate and added latency stay within the product’s quality target.

The routing contract

Model routing selects a candidate model for each request under a quality, cost, or latency constraint. The router can be a heuristic, classifier, learned preference model, or cascade that evaluates a cheaper model’s response before escalating.

Routers split along two axes that matter operationally. The first axis is when the routing decision happens: predictive routing makes a single choice upfront based on features of the request (token length, detected language, embedding-space clusters, a learned classifier’s verdict); cascade routing (also called sequential routing) starts with the cheapest candidate, evaluates the response, and escalates to a more expensive tier only if the cheap response fails a confidence check. The second axis is what the router optimizes: pure cost minimization at a quality floor (don’t drop accuracy below X, minimize spend), quality maximization at a cost ceiling (don’t spend more than Y per request, maximize accuracy), or latency-sensitive routing (route by tail-latency budget, not cost). Most production routers are predictive, optimize cost at a quality floor, and run on the order of single-digit milliseconds; small enough to live in the request path without showing up as a meaningful p99 contribution.

Routing operates inside a workload against a fixed model pool. A uniformly difficult workload has little routing headroom; mixed workloads can save money when a cheap tier handles a large share of requests without quality loss.

Router architectures

Predictive routing: classify upfront, commit

The cheapest router shape. A small classifier; logistic regression on hand-engineered features, a fine-tuned small LM, or a matrix-factorization model over the request and the candidate pool; scores each incoming request and assigns it to a tier in one shot. The classifier runs in milliseconds; the routing decision is final. The training data is pairs of (request, which-model-was-good-enough) collected from offline labeling, LLM-as-judge runs, or user feedback signals from a prior shadow-routed system.

RouteLLM, the open-source framework released by LMSYS in July 2024 and accepted to ICLR 2025, is the canonical worked example. The paper trains four router architectures on Chatbot Arena preference data: similarity-weighted ranking, a matrix-factorization model that decomposes (query, model) → reward, a BERT classifier, and a causal LM classifier. The headline result, reproduced from the paper: the matrix-factorization router achieves 95% of GPT-4’s quality on MT-Bench while routing only 26% of queries to GPT-4; a 48% cost reduction vs. random baseline at iso-quality. With data augmentation from an LLM judge, the same router hits 95% quality at only 14% strong-model calls; a 75% cost reduction.

Martian and NotDiamond are the commercial heirs of this architecture, with refinements: per-customer custom routers trained on the customer’s own traffic (the NotDiamond router-training quickstart walks through this), routers that account for provider-side outages and latency in addition to quality, and feature engineering that includes embedding distance from training-set clusters. The Martian site claims cost reductions of 20–97% depending on workload; NotDiamond reports a 39% accuracy increase across SRE benchmarks in one published enterprise case study. The honest reading of these numbers is that they’re highly workload-specific; a recent RouterArena benchmark found NotDiamond ranked 12th on LongBench-v2 because its general-purpose router frequently selected expensive models for queries that didn’t need them. Custom routers trained on workload data consistently outperform general-purpose ones; off-the-shelf routers are the right choice when you don’t have labeled traffic yet and the wrong choice once you do.

Cascade routing: start cheap, escalate on low confidence

The shape from the FrugalGPT paper (Chen, Zaharia, Zou, May 2023). Try the cheap model first; check the response against a score function that estimates whether the answer is acceptable; if it passes, return it; if not, escalate to the next tier and repeat. The score function is the hard part; it can be a separate small judge model, the cheap model’s own log-probabilities, a verifier that runs the answer against a known schema, or a simple regex/structured-output check. The original FrugalGPT result: matching GPT-4’s accuracy at 98% cost reduction on the workloads they evaluated, by running queries through a cascade of 12 candidate models with a learned regression-based scoring function.

Cascade routing’s structural advantage over predictive routing is that it gets to see the cheap model’s actual response before deciding. The predictive router commits to a tier on the input alone, which means it has to predict response quality without observing it; the cascade gets to score the realized response, which is a much easier learning problem. This is also cascade routing’s structural disadvantage: every escalation pays both the cheap model’s full cost and the expensive model’s full cost. If 40% of queries escalate, the average cost is 0.6·C₁ + 0.4·(C₁ + C₂) instead of the predictive router’s 0.6·C₁ + 0.4·C₂; the cascade pays C₁ on every request, escalations included. Cascades win when the cheap model handles a large majority of traffic and the escalation rate stays under ~30%; they lose when the workload is hard enough that escalations dominate.

Cascade routing also has a latency tail that predictive routing doesn’t: every escalation adds a full extra round-trip to the request path. If the cheap model takes 400ms and the expensive model takes 1.5s, escalations land at ~1.9s of end-to-end latency instead of 1.5s. Production cascades therefore tune the cheap-model timeout aggressively (often well below the cheap model’s p99 latency) and accept some “cheap timeout, escalate anyway” cases to keep tail latency bounded.

Content-based routing: heuristics on features you already have

The shape that doesn’t get its own paper but ships first in most teams because the features are free. Route on detected language (English to GPT-5.4-mini, Mandarin to Qwen-Plus), on content type (code → a code-specialized model, prose → a general-purpose one), on detected complexity proxies (token count, presence of math/code/tables), on user tier (free users get the cheap model, paying users get the expensive one). Heuristic content-based routing is what every team starts with and what most teams continue to use as a layer underneath a learned router; the heuristic does the obvious splits and the learned router does the harder per-request calls inside the remaining bucket.

The honest framing: content-based routing is what gets you 60–80% of the routing win at near-zero implementation cost. The marginal headroom from layering a learned router on top is real but typically smaller than the heuristic-only baseline. Build heuristic first, measure, layer learned routing only where the heuristic leaves obvious money on the table.

Expected cost and break-even

Concrete numbers as of May 2026, based on the published pricing across providers. The four model tiers most production teams actually route between:

Tier	Provider/Model	Input $/M	Output $/M	Avg cost / 500-in/300-out request
Cheapest	Gemini 2.5 Flash Lite	$0.10	$0.40	~$0.00017
Cheap	Haiku 4.5	$1.00	$5.00	~$0.002
Mid	GPT-5.4-mini	$0.75	$4.50	~$0.0017
Frontier	Sonnet 4.6	$3.00	$15.00	~$0.006
Top	Opus 4.7	$5.00	$25.00	~$0.010
Top	GPT-5.5	$5.00	$30.00	~$0.011

The cost-savings math for an always-Opus baseline (~$0.010/request) routed to a Haiku-mostly policy:

70% Haiku + 30% Opus + $0.001 router overhead = 0.70·0.002 + 0.30·0.010 + 0.001 = $0.0054/request, a 46% reduction.
90% Haiku + 10% Opus + $0.001 = $0.0029/request, a 71% reduction.
100% Haiku (no router, accept quality drop) = $0.002/request, a 80% reduction at unknown quality cost.

The breakeven analysis: a router with overhead r is worth running iff (1-p_strong)·(C_strong - C_weak) > r. Plugging in p_strong = 0.30, C_strong = $0.010, C_weak = $0.002: the savings are 0.70·0.008 = $0.0056/request. Anything less than $5.60 per 1000 requests in router overhead is a net win. At Anthropic Haiku pricing for a typical classifier prompt (~500 input tokens, ~10 output tokens) that’s about $0.0006/request; well under the breakeven, by an order of magnitude. The router pays for itself even before you account for the latency improvement on the routed-to-Haiku majority of traffic.

Two structural numbers it’s worth committing to memory. The 5× rule of Anthropic pricing: output costs 5× input across every tier, so any optimization that reduces output tokens (more concise prompts, schema-constrained generation via structured output, early stopping) has 5× the leverage of the same reduction on input tokens. The 50% batch-API discount: the Anthropic Batch API and OpenAI’s batch endpoint both charge 50% of standard rates for asynchronous, ≤24-hour-latency jobs. Any workload that can tolerate that latency; overnight evals, content generation queues, reflection/consolidation passes, sleep-time compute; runs at half the price card, compounding with prompt caching for a 0.5 × 0.1 = 5% effective rate on cached batch jobs.

Code: a hand-rolled cascade router in Python

The skeleton below implements a two-tier cascade against the Anthropic SDK: try Haiku first, escalate to Sonnet if a self-reported confidence score from Haiku falls below a threshold. The confidence signal here is a simple structured-output score from the model itself; production systems would use a separate judge or learned regressor, but the self-report works as a starting baseline and the math doesn’t change.

python

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
# pip install anthropic
import os
import time
from dataclasses import dataclass
from typing import Literal

import anthropic
from pydantic import BaseModel, Field

client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

# Per-million pricing as of May 2026.
PRICE_PER_M = {
    "claude-haiku-4-5":  {"input": 1.0, "output":  5.0},
    "claude-sonnet-4-6": {"input": 3.0, "output": 15.0},
    "claude-opus-4-7":   {"input": 5.0, "output": 25.0},
}


class CascadeResponse(BaseModel):
    answer: str = Field(description="The actual answer.")
    self_confidence: float = Field(
        ge=0, le=1, description="How confident the model is in the answer."
    )


@dataclass
class RouteTrace:
    model_used: str
    escalated: bool
    cost_usd: float
    latency_ms: float
    input_tokens: int
    output_tokens: int


def call(model: str, prompt: str) -> tuple[CascadeResponse, dict]:
    """One call. Returns parsed response + usage."""
    t0 = time.perf_counter()
    msg = client.messages.create(
        model=model,
        max_tokens=512,
        messages=[{"role": "user", "content": prompt}],
        system=(
            "Respond with a JSON object: "
            '{"answer": <string>, "self_confidence": <float in [0,1] '
            "indicating how confident you are this answer is correct>}. "
            "Be conservative; if the question is ambiguous, lower the score."
        ),
    )
    latency_ms = (time.perf_counter() - t0) * 1000
    text = "".join(b.text for b in msg.content if b.type == "text")
    # In production, use the structured-output article's robust JSON parsing.
    parsed = CascadeResponse.model_validate_json(text)
    return parsed, {
        "input_tokens": msg.usage.input_tokens,
        "output_tokens": msg.usage.output_tokens,
        "latency_ms": latency_ms,
    }


def cost_for(model: str, in_tok: int, out_tok: int) -> float:
    p = PRICE_PER_M[model]
    return (in_tok * p["input"] + out_tok * p["output"]) / 1_000_000


def cascade(
    prompt: str,
    cheap: str = "claude-haiku-4-5",
    expensive: str = "claude-sonnet-4-6",
    confidence_floor: float = 0.7,
) -> tuple[CascadeResponse, RouteTrace]:
    cheap_resp, cheap_usage = call(cheap, prompt)
    cheap_cost = cost_for(cheap, cheap_usage["input_tokens"], cheap_usage["output_tokens"])

    if cheap_resp.self_confidence >= confidence_floor:
        return cheap_resp, RouteTrace(
            model_used=cheap, escalated=False, cost_usd=cheap_cost,
            latency_ms=cheap_usage["latency_ms"],
            input_tokens=cheap_usage["input_tokens"],
            output_tokens=cheap_usage["output_tokens"],
        )

    # Escalate. We pay the cheap call's cost *and* the expensive call.
    exp_resp, exp_usage = call(expensive, prompt)
    exp_cost = cost_for(expensive, exp_usage["input_tokens"], exp_usage["output_tokens"])
    return exp_resp, RouteTrace(
        model_used=expensive, escalated=True,
        cost_usd=cheap_cost + exp_cost,
        latency_ms=cheap_usage["latency_ms"] + exp_usage["latency_ms"],
        input_tokens=cheap_usage["input_tokens"] + exp_usage["input_tokens"],
        output_tokens=cheap_usage["output_tokens"] + exp_usage["output_tokens"],
    )


if __name__ == "__main__":
    # Realistic spread: easy/medium/hard.
    prompts = [
        "What's the capital of France?",  # easy → cheap should be confident
        "Summarize this sentence: 'The cat sat on the mat.'",  # easy
        "Walk me through the trade-offs of using a B-tree vs. an LSM-tree for "
        "a write-heavy workload, and quantify the read amplification.",  # hard
        "Prove that the sum of two even numbers is even.",  # easy
    ]
    total_cost = 0.0
    escalations = 0
    for p in prompts:
        resp, trace = cascade(p)
        print(f"[{trace.model_used:25s}] escalated={trace.escalated} "
              f"cost=${trace.cost_usd:.5f} latency={trace.latency_ms:.0f}ms "
              f"answer={resp.answer[:60]}...")
        total_cost += trace.cost_usd
        if trace.escalated:
            escalations += 1
    print(f"\nTotal: ${total_cost:.4f} over {len(prompts)} requests, "
          f"escalations={escalations}/{len(prompts)}")

the self-confidence score is the simplest possible cascade signal but also the weakest; production cascades typically use either a separate small judge (one LLM-as-judge call on the cheap model’s output) or a learned regressor that takes the cheap response’s log-probs and outputs an accept/escalate score. The self-report version above is fine for getting a cascade running but should be replaced with a learned scorer once you have ~1k labeled (prompt, cheap_response, was_good_enough) tuples. The cascade pays the cheap call’s cost on every request and the expensive call’s cost on escalations; the cost-savings math depends entirely on the escalation rate staying under ~30%. Tune the confidence_floor against your workload.