Cost Optimization and Model Routing
Tiered model routing, cascades, and learned routers — RouteLLM, Martian, NotDiamond, OpenRouter, LiteLLM — plus the cost math that tells you when to route.
A consumer-facing chat product is spending $180k/month on Anthropic Sonnet 4.6 and the CFO wants 40% out of the line. The team’s first instinct is to migrate everything to Haiku 4.5 — at $1/$5 per million input/output it’s a third the price — but a week-long eval shows Haiku drops accuracy on the hard 15% of traffic by enough that complaint volume doubles. The win the team eventually ships isn’t a model migration. It’s a router: a small classifier in front of every request that sends 70% of traffic to Haiku and the remaining 30% — the queries the classifier flags as complex — to Sonnet. Spend drops 52%, complaints drop 4%, p99 latency goes up by ~80ms (the router’s own inference time). This is the cost-optimization problem the article exists to make legible: once your model is fixed, the cheapest single lever for cutting the bill is not sending every request to the same tier.
Opening bridge
The last two articles in the Production & Operations subtree — inference latency and speculative decoding — both attacked the cost-and-latency problem on the server side: continuous batching, prefill/decode disaggregation, draft-and-verify decoding. Today’s piece moves the lever upstream, into the application layer. Once the inference engine is doing the cheapest possible thing per call, the next dollar of savings comes from fewer expensive calls in the first place. The most direct way to get there — short of fine-tuning a smaller model to do the flagship’s job (which the next article in this subtree covers), or distilling a frontier teacher into a deployable student (which the curriculum’s training-side closer covers) — is to route per-request. Same target quality, lower average cost; same dollar budget, higher headroom for the hard cases. Routing is also the operational story behind prompt caching’s cost math: the savings you measure from caching only land if the cached prefix flows through a model that justifies the cache write, which is itself a routing decision.
Definition
Model routing is the process of selecting, per request, the cheapest model from a candidate pool that meets the quality requirement for that specific request. The pool is typically a tier of models from the same provider (Haiku → Sonnet → Opus, GPT-5-nano → GPT-5.4-mini → GPT-5.5) or a mix across providers (Haiku for chat, GPT-5.4-mini for code, Gemini 2.5 Flash for long-context summarization). The selection function is the router: a piece of logic — heuristic, classifier, or learned model — that maps each incoming request to a candidate model.
Routers split along two axes that matter operationally. The first axis is when the routing decision happens: predictive routing makes a single choice upfront based on features of the request (token length, detected language, embedding-space clusters, a learned classifier’s verdict); cascade routing (also called sequential routing) starts with the cheapest candidate, evaluates the response, and escalates to a more expensive tier only if the cheap response fails a confidence check. The second axis is what the router optimizes: pure cost minimization at a quality floor (don’t drop accuracy below X, minimize spend), quality maximization at a cost ceiling (don’t spend more than Y per request, maximize accuracy), or latency-sensitive routing (route by tail-latency budget, not cost). Most production routers are predictive, optimize cost at a quality floor, and run on the order of single-digit milliseconds — small enough to live in the request path without showing up as a meaningful p99 contribution.
The framing it’s worth keeping straight: routing is not model selection at deployment time. Picking GPT-5.5 vs. Claude Opus 4.7 vs. Gemini 2.5 Pro for an entire workload is a vendor decision. Routing happens inside a workload, per request, against a fixed pool. The dollar savings come from the variance in request difficulty — a workload where every request is uniformly hard has no routing headroom, a workload where 60% of requests are trivial and 40% require frontier reasoning is where routing pays back its overhead the fastest.
Intuition: the Pareto frontier of cost and quality
The mental model that does the most work is the cost-quality Pareto frontier. Plot every candidate model’s average accuracy on your workload against its average dollar cost per request. The frontier is the upper-left convex hull — models that are simultaneously cheaper and better than something else are off the frontier and can be deleted from the pool. The remaining models are only better in one dimension at the expense of the other.
A routing policy is a point inside the frontier, or, with a perfect router, a point on the frontier between two models. With a typical real-world router — accuracy somewhere in the 70–85% range at distinguishing easy from hard queries — the achievable point sits below the frontier connecting the cheapest and most expensive models in the pool, and the gap from the frontier is the router’s “tax”. The closer the router is to a perfect classifier, the closer your achievable point gets to the linear interpolation between the two endpoints.
The cost math that follows directly: suppose the cheap model costs C₁ and the expensive model costs C₂, the fraction of traffic the router sends to the expensive model is p, and the router’s per-request inference cost is r. Expected cost per request is p·C₂ + (1-p)·C₁ + r. With Haiku 4.5 at ~$0.003/request average and Sonnet 4.6 at ~$0.012/request average, a router that pushes 30% to Sonnet costs 0.30·0.012 + 0.70·0.003 + r ≈ $0.0057 + r. Even at r = $0.001 (an embedding lookup plus a logistic regression), the routed cost is $0.0067/request vs. $0.012/request for the always-Sonnet baseline — a 44% reduction. The numbers in the consumer-chat opener are specifically this calculation.
The complementary frame: a router is a load-shedder. Distributed systems shed load by dropping or degrading low-priority requests when capacity tightens; a model router shed-degrades every request to the cheapest tier it can without hurting quality. The mechanism is the same — admission control on a finite budget — but the budget is dollars per request rather than CPU cycles or memory. The same Little’s-Law-style reasoning applies: utilization × variability is the master variable, and the lever is to reduce variability in which model handles which request.
The distributed-systems parallel
The closest analogue is tiered storage with cost-aware placement. A modern object store places hot blocks on NVMe, warm blocks on HDD, cold blocks on tape; the placement policy is driven by a small classifier (recency, access frequency, last-touch timestamp) that runs cheaply enough to pay back its overhead with the storage savings on the bulk of blocks. The LLM analogue maps almost exactly: the router is the placement classifier; the tiers are the model price points; the blocks are the requests; the access frequency features are workload-specific request signatures (query length, presence of code, embedding-space cluster). A miss — a hard request routed to the cheap tier — is the equivalent of a cold-block hit on an HDD: you take a quality penalty, you log the miss, and the policy learns from it.
The deeper parallel is the read-replica / write-master split in a database cluster. Read replicas serve the bulk of read traffic at low cost; the master handles writes and the small slice of reads that need strong consistency. The split exists because most reads don’t need master-level consistency, and pretending they all do means paying master prices for replica work. The LLM equivalent: most requests don’t need flagship-level reasoning, and pretending they all do means paying frontier prices for routine work. The router is the read-router — same primitive, same dollar logic, different layer of the stack.
A real disanalogy that bites. A database read router can verify routing correctness mechanically — if a replica is behind on the binlog, the router knows by a checkable lag metric and falls back to the master. A model router has no equivalent mechanical check on its routing decision. The router that sent a hard query to Haiku doesn’t know the answer was bad until it sees a downstream signal — a user thumbs-down, a low-confidence score from the model itself, a drift detection alert. This is why production routers always run with a cascade fallback (escalate on low confidence) or a shadow eval pipeline (sample some routed-to-cheap requests, re-run them on the expensive tier offline, measure the quality delta), and the eval-driven-development and LLM-as-judge primitives in the curriculum are the supporting infrastructure for both of those.
The three router architectures
Predictive routing: classify upfront, commit
The cheapest router shape. A small classifier — logistic regression on hand-engineered features, a fine-tuned small LM, or a matrix-factorization model over the request and the candidate pool — scores each incoming request and assigns it to a tier in one shot. The classifier runs in milliseconds; the routing decision is final. The training data is pairs of (request, which-model-was-good-enough) collected from offline labeling, LLM-as-judge runs, or user feedback signals from a prior shadow-routed system.
RouteLLM, the open-source framework released by LMSYS in July 2024 and accepted to ICLR 2025, is the canonical worked example. The paper trains four router architectures on Chatbot Arena preference data: similarity-weighted ranking, a matrix-factorization model that decomposes (query, model) → reward, a BERT classifier, and a causal LM classifier. The headline result, reproduced from the paper: the matrix-factorization router achieves 95% of GPT-4’s quality on MT-Bench while routing only 26% of queries to GPT-4 — a 48% cost reduction vs. random baseline at iso-quality. With data augmentation from an LLM judge, the same router hits 95% quality at only 14% strong-model calls — a 75% cost reduction.
Martian and NotDiamond are the commercial heirs of this architecture, with refinements: per-customer custom routers trained on the customer’s own traffic (the NotDiamond router-training quickstart walks through this), routers that account for provider-side outages and latency in addition to quality, and feature engineering that includes embedding distance from training-set clusters. The Martian site claims cost reductions of 20–97% depending on workload; NotDiamond reports a 39% accuracy increase across SRE benchmarks in one published enterprise case study. The honest reading of these numbers is that they’re highly workload-specific — a recent RouterArena benchmark found NotDiamond ranked 12th on LongBench-v2 because its general-purpose router frequently selected expensive models for queries that didn’t need them. Custom routers trained on workload data consistently outperform general-purpose ones; off-the-shelf routers are the right choice when you don’t have labeled traffic yet and the wrong choice once you do.
Cascade routing: start cheap, escalate on low confidence
The shape from the FrugalGPT paper (Chen, Zaharia, Zou, May 2023). Try the cheap model first; check the response against a score function that estimates whether the answer is acceptable; if it passes, return it; if not, escalate to the next tier and repeat. The score function is the hard part — it can be a separate small judge model, the cheap model’s own log-probabilities, a verifier that runs the answer against a known schema, or a simple regex/structured-output check. The original FrugalGPT result: matching GPT-4’s accuracy at 98% cost reduction on the workloads they evaluated, by running queries through a cascade of 12 candidate models with a learned regression-based scoring function.
Cascade routing’s structural advantage over predictive routing is that it gets to see the cheap model’s actual response before deciding. The predictive router commits to a tier on the input alone, which means it has to predict response quality without observing it; the cascade gets to score the realized response, which is a much easier learning problem. This is also cascade routing’s structural disadvantage: every escalation pays both the cheap model’s full cost and the expensive model’s full cost. If 40% of queries escalate, the average cost is 0.6·C₁ + 0.4·(C₁ + C₂) instead of the predictive router’s 0.6·C₁ + 0.4·C₂ — the cascade pays C₁ on every request, escalations included. Cascades win when the cheap model handles a large majority of traffic and the escalation rate stays under ~30%; they lose when the workload is hard enough that escalations dominate.
Cascade routing also has a latency tail that predictive routing doesn’t: every escalation adds a full extra round-trip to the request path. If the cheap model takes 400ms and the expensive model takes 1.5s, escalations land at ~1.9s of end-to-end latency instead of 1.5s. Production cascades therefore tune the cheap-model timeout aggressively (often well below the cheap model’s p99 latency) and accept some “cheap timeout, escalate anyway” cases to keep tail latency bounded.
Content-based routing: heuristics on features you already have
The shape that doesn’t get its own paper but ships first in most teams because the features are free. Route on detected language (English to GPT-5.4-mini, Mandarin to Qwen-Plus), on content type (code → a code-specialized model, prose → a general-purpose one), on detected complexity proxies (token count, presence of math/code/tables), on user tier (free users get the cheap model, paying users get the expensive one). Heuristic content-based routing is what every team starts with and what most teams continue to use as a layer underneath a learned router — the heuristic does the obvious splits and the learned router does the harder per-request calls inside the remaining bucket.
The honest framing: content-based routing is what gets you 60–80% of the routing win at near-zero implementation cost. The marginal headroom from layering a learned router on top is real but typically smaller than the heuristic-only baseline. Build heuristic first, measure, layer learned routing only where the heuristic leaves obvious money on the table.
Mechanics: the cost math worked end-to-end
Concrete numbers as of May 2026, based on the published pricing across providers. The four model tiers most production teams actually route between:
| Tier | Provider/Model | Input $/M | Output $/M | Avg cost / 500-in/300-out request |
|---|---|---|---|---|
| Cheapest | Gemini 2.5 Flash Lite | $0.10 | $0.40 | ~$0.00017 |
| Cheap | Haiku 4.5 | $1.00 | $5.00 | ~$0.002 |
| Mid | GPT-5.4-mini | $0.75 | $4.50 | ~$0.0017 |
| Frontier | Sonnet 4.6 | $3.00 | $15.00 | ~$0.006 |
| Top | Opus 4.7 | $5.00 | $25.00 | ~$0.010 |
| Top | GPT-5.5 | $5.00 | $30.00 | ~$0.011 |
The cost-savings math for an always-Opus baseline (~$0.010/request) routed to a Haiku-mostly policy:
- 70% Haiku + 30% Opus + $0.001 router overhead =
0.70·0.002 + 0.30·0.010 + 0.001 = $0.0054/request, a 46% reduction. - 90% Haiku + 10% Opus + $0.001 =
$0.0029/request, a 71% reduction. - 100% Haiku (no router, accept quality drop) =
$0.002/request, a 80% reduction at unknown quality cost.
The breakeven analysis: a router with overhead r is worth running iff (1-p_strong)·(C_strong - C_weak) > r. Plugging in p_strong = 0.30, C_strong = $0.010, C_weak = $0.002: the savings are 0.70·0.008 = $0.0056/request. Anything less than $5.60 per 1000 requests in router overhead is a net win. At Anthropic Haiku pricing for a typical classifier prompt (~500 input tokens, ~10 output tokens) that’s about $0.0006/request — well under the breakeven, by an order of magnitude. The router pays for itself even before you account for the latency improvement on the routed-to-Haiku majority of traffic.
Two structural numbers it’s worth committing to memory. The 5× rule of Anthropic pricing: output costs 5× input across every tier, so any optimization that reduces output tokens (more concise prompts, schema-constrained generation via structured output, early stopping) has 5× the leverage of the same reduction on input tokens. The 50% batch-API discount: the Anthropic Batch API and OpenAI’s batch endpoint both charge 50% of standard rates for asynchronous, ≤24-hour-latency jobs. Any workload that can tolerate that latency — overnight evals, content generation queues, reflection/consolidation passes, sleep-time compute — runs at half the price card, compounding with prompt caching for a 0.5 × 0.1 = 5% effective rate on cached batch jobs.
Code: a hand-rolled cascade router in Python
The skeleton below implements a two-tier cascade against the Anthropic SDK: try Haiku first, escalate to Sonnet if a self-reported confidence score from Haiku falls below a threshold. The confidence signal here is a simple structured-output score from the model itself — production systems would use a separate judge or learned regressor, but the self-report works as a starting baseline and the math doesn’t change.
| |
Two operational notes. First, the self-confidence score is the simplest possible cascade signal but also the weakest; production cascades typically use either a separate small judge (one LLM-as-judge call on the cheap model’s output) or a learned regressor that takes the cheap response’s log-probs and outputs an accept/escalate score. The self-report version above is fine for getting a cascade running but should be replaced with a learned scorer once you have ~1k labeled (prompt, cheap_response, was_good_enough) tuples. Second, the cascade pays the cheap call’s cost on every request and the expensive call’s cost on escalations — the cost-savings math depends entirely on the escalation rate staying under ~30%. Tune the confidence_floor against your workload.
Code: routing through LiteLLM with fallback and retry
LiteLLM is the open-source proxy/router that most teams reach for once they want OpenAI-compatible routing across multiple providers with fallback, retry, and per-tier configuration. The TypeScript example below uses the LiteLLM proxy as a routing layer — the cascade logic lives in the proxy config; the client just makes an OpenAI-format request to the cheapest deployment and the proxy handles fallback to a stronger model on error or rate-limit:
| |
The LiteLLM pattern is qualitatively different from the hand-rolled cascade above. LiteLLM’s fallback is reliability-driven — it triggers on errors and rate-limits, not on quality scores. The pattern you’d use in production is to stack both: LiteLLM for provider failover (Anthropic Sonnet → OpenAI GPT-5.4 if Anthropic returns a 5xx), and your own application-layer cascade for cost-quality routing (Haiku → Sonnet based on a confidence score). Conflating the two routing layers — using the same router for both reliability and cost optimization — is the most common architectural mistake teams make in this space, because the policies are different (reliability wants fast escalation on any failure; cost-optimization wants to suppress escalation as much as possible) and putting both in the same logic surface creates the worst of both.
OpenRouter is the hosted-proxy alternative: it sits in front of 300+ models, prices each call at the underlying provider’s rate, and falls back automatically on errors. The OpenRouter fallback docs cover the routing knobs — models for explicit fallback chains, sort: "price" for cheapest-available routing, sort: "latency" for latency-sensitive workloads. OpenRouter only bills for the model that actually served the request, which removes the cost-of-failed-attempts overhead that hand-rolled cascade routers pay.
Trade-offs, failure modes, gotchas
Routers don’t compose with prompt caching the way you’d hope. Prompt caching requires routing identical prefixes to the same physical inference machine (the cache is per-machine, not global). A router that sends some of a user’s requests to Haiku and others to Sonnet writes the user’s system prompt to two cache namespaces, paying cache-write prices on both. The fix is to route at a coarser granularity — at the session level rather than per-turn — so each session is sticky to one model and the cache amortizes across the session’s turns. Routers that decide per-turn destroy cache hit rate; routers that decide per-session preserve it.
The escalation tail dominates p99 latency in cascades. A predictive router has roughly uniform latency (router decision + chosen model’s call); a cascade pays the cheap model’s full latency on every escalation, then the expensive model’s full latency on top. If 25% of traffic escalates, the p75 latency is the cheap-model latency but the p90+ is cheap + expensive. The opening example’s 80ms p99 increase from the router is the predictive-router case; a cascade with the same routing rate would land closer to 500ms+ in the p90, dominated by the escalations. This is why every production cascade I’ve seen ships with an aggressive cheap-model timeout (300-800ms typically) and a “timeout means escalate” rule rather than waiting for the cheap model’s natural p99.
The cheap model’s failures are correlated with the hard requests. A naive estimate of cascade savings treats easy and hard requests as equally distributed across the workload; in practice, the hard requests cluster — particular topics, specific user segments, certain time-of-day patterns. A cheap model’s misses are concentrated, not Poisson-distributed. If your cascade is hitting the published savings number on average but blowing up on a particular customer segment, the cause is concentration: that segment is on the hard tail. Per-segment routing policies — train a separate router on each major customer or topic cluster — are the standard fix.
Models retire faster than routers retrain. Your router was trained on Sonnet 4.5 and now Sonnet 4.6 is the default. The new model’s strengths and weaknesses are different — it’s better at code but slightly worse at adversarial reasoning, say — and your router’s predictions are now miscalibrated. The half-life of a learned router against a fast-moving model lineup is shorter than the half-life of most production ML models. Teams that don’t have a continuous-retraining loop set up will silently lose router quality on every provider model upgrade. The drift-detection article in this curriculum walks through the alerting infrastructure for catching this; the operational discipline is: re-evaluate routing decisions every time a candidate model changes, treat the router itself as a model with its own eval suite and a regression-test on each release.
The “always escalate on tool calls” gotcha. Tool-use workloads have a hidden cost structure: the cheap model makes the tool call, the result comes back, and the loop continues with the cheap model — but if the cheap model misuses the tool (wrong arguments, missing a required field, looping on a failed call), the cascade has to escalate mid-loop. Mid-loop escalation is harder than pre-call escalation because the conversation state has to be preserved across the tier change, and not every model handles the same tool schema identically (Anthropic’s tool-use format and OpenAI’s function-calling format differ enough that switching providers mid-conversation requires normalization). Practical rule: if your workload is tool-call-heavy, route at the start of the conversation and stay on that tier; don’t try to mix tiers within a multi-turn tool-call loop.
The “router as single point of failure” gotcha. A learned router is an inference call that has to succeed for every downstream request. If the router model fails or times out, the system either has to fall back to a default tier (typically the expensive one, eliminating the cost saving for that request) or fail the request entirely. Production routers should have a static fallback policy — “if the router can’t decide in N ms, send to the strong model” — and that policy’s cost should be amortized into the cost model. The 99.9% reliability of a routing service costs you 0.1% of requests at the strong-model price; the cost math has to include that.
Further reading from the field
- RouteLLM: An Open-Source Framework for Cost-Effective LLM Routing — LMSYS, July 2024 — the canonical announcement post for the open-source RouteLLM framework; walks through the four router architectures and the cost-savings numbers reproduced above.
- FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance — Chen, Zaharia, Zou, 2023 — the cascade-routing paper. Three orthogonal strategies (prompt adaptation, LLM approximation, LLM cascade) with a worked benchmark showing 98% cost reduction at GPT-4 quality.
- Building an LLM Router for High-Quality and Cost-Effective Responses — Anyscale, 2024 — a hands-on guide to deploying a learned router in production, with discussion of the train/eval loop and the per-segment-policy pattern.
- Why Accenture and Martian see model routing as key to enterprise AI success — VentureBeat, 2025 — the enterprise-adoption framing for routing, including the air-traffic-control mental model and the 37%-of-enterprises-use-5+-models statistic.
What to read next
- Fine-Tuning vs RAG: When to Choose Which — the next lever after routing: when the workload is uniform enough that routing can’t shed it, the choice becomes whether to change the model or change the prompt.
- Speculative Decoding and Draft Models — the server-side cost-and-latency optimization that complements application-level routing.
- Prompt Caching: Reusing the KV Cache Across Calls — the optimization that routing has to play nicely with (route at session granularity, not per-turn).
- Quantization and Distillation: Compression for Inference — the curriculum closer. Distillation is how the small-model tier in a routing architecture gets good — fine-tune a Haiku-class student on traces from a Sonnet/Opus teacher and the cheap tier inherits a chunk of the expensive tier’s behavior on the workload’s distribution.