Speculative Decoding and Draft Models
Draft-and-verify decoding: how speculative sampling, Medusa, EAGLE-3, and ngram methods turn one forward pass into many tokens — and when it pays.
A team flips on speculative decoding in their self-hosted vLLM cluster on a Tuesday afternoon, sees a clean 2.3× TPOT improvement in the staging benchmark, ships it, and watches throughput at the 18:00 traffic peak drop 12%. The acceptance rate looks fine on the dashboard. Tail TPOT is worse, not better. The fix is one config line — disable speculation above a batch-size threshold — but the underlying lesson is the one this article exists to make legible: speculative decoding is a contract between the inference engine and the batch scheduler, and the contract terms change with load. Every production deployment of speculative decoding in 2026 is a tuning exercise around that contract.
Opening bridge
Yesterday’s article on inference latency ended on a foreshadowing: speculative decoding interacts with batching in subtle ways, and the speedup you measure at low load doesn’t survive contact with a saturated scheduler. This piece picks that thread up. Inference latency, in the prefill-then-decode model laid out yesterday, has two bottlenecks: the compute-bound prefill (which chunked prefill and prompt caching attack) and the memory-bandwidth-bound decode (which batching attacks). Speculative decoding is the second class of attack on the decode bottleneck — instead of amortizing the model-weight read across more concurrent users, it amortizes it across more tokens per user per step. Same HBM bandwidth, more tokens out the other end. When it works, it’s a free 2–3× on TPOT for low-concurrency workloads; when it doesn’t, it’s a regression that’s invisible on most dashboards.
Definition
Speculative decoding generates k > 1 tokens per target-model forward pass by having a fast drafter propose a candidate continuation and the slow target model verify it in parallel, accepting the longest prefix whose distributions match. The drafter is anything that can cheaply guess the next few tokens: a small autoregressive model (the original “Fast Inference from Transformers via Speculative Decoding” formulation by Leviathan, Kalman, and Matias at Google, ICML 2023), extra prediction heads grafted onto the target (Medusa, Cai et al., ICML 2024), a feature-level drafter that consumes the target’s penultimate-layer hidden states (EAGLE, Li et al., 2024 — extended through EAGLE-2, EAGLE-3, and the EAGLE 3.1 release on May 26, 2026), or even just an n-gram lookup over the prompt and recent generations (prompt-lookup decoding). The verifier is always the original target model, run once per speculation step on the k drafted tokens at the cost of one ordinary decode step.
The crucial property — the one that makes speculation safe to ship in front of users — is that the accepted output distribution is mathematically identical to ordinary autoregressive sampling from the target. Leviathan’s formulation uses modified rejection sampling on the drafter/target probability ratio; Medusa relaxes this to typical acceptance that admits any candidate whose target probability exceeds an entropy-derived threshold. The first scheme is provably output-equivalent to the target; the second is approximately equivalent and trades a small distribution shift for a meaningfully higher acceptance rate. Either way, the system semantics are: speculation either delivers the target’s tokens faster, or it falls back to ordinary decoding for that step — never silently substitutes a cheap model’s tokens for the expensive one’s.
Intuition: branch prediction for the decode loop
The cleanest mental model is branch prediction with rollback. A modern superscalar CPU doesn’t wait for each instruction to retire before issuing the next one — it speculatively executes down a predicted branch, and if the prediction misses, it flushes the pipeline and pays the misprediction penalty. Speculative decoding does the same thing at the token level. The drafter predicts the next k tokens; the target verifies the prediction in one parallel forward pass; on a hit (the drafter’s tokens are what the target would have sampled anyway) you ship k tokens for the cost of one step; on a miss you ship the accepted prefix plus the target’s correction token and pay the cost of the drafter’s wasted work. The economics work as long as the drafter is cheap enough and the acceptance rate is high enough that the average tokens-per-step is meaningfully greater than one.
The complementary frame is the read-amplification trick. A 70B-class model at FP16 reads ~140 GB of weights from HBM on every decode step (more concretely, the entire model and the growing KV cache for the active sequence). That HBM read is the dominant cost of decode; the FLOPs to actually compute one new token’s output are negligible by comparison. A speculative-decoding verification step also reads ~140 GB — but it produces logits for k positions, not one. If the drafter got 4 of those positions right, the target shipped 4× the tokens for ~1× the HBM bandwidth. This is why speculative decoding is fundamentally a memory-bandwidth-amplification trick at the GPU level: it doesn’t speed up the model, it makes each model-weight read amortize over more tokens. The orthogonal lever is quantization, which shrinks the HBM read itself — INT4 takes the 140 GB down to ~35 GB — and the two stack: speculation amortizes the read across more tokens, quantization shrinks the read, and a well-tuned production deployment uses both.
A useful sanity check: the math degrades fast at high batch size. At batch 1, decode is decisively memory-bandwidth-bound, and speculative decoding’s read-amplification trick is pure gain. At batch 16+, decode is approaching compute-bound (many sequences share each weight read, so the per-sequence bandwidth cost is already amortized), and speculation’s extra verification work starts to compete for the same compute units that were already saturated. The vLLM blog’s benchmarks consistently show 2–3× speedups at batch 1–10 and either marginal gains or net regressions above that. This is the contract whose terms change with load.
The math: when speculation is a win
Three numbers govern whether speculation pays off, in the framing from Leviathan et al.:
- α (alpha) — the acceptance rate. The probability that a single drafter-proposed token survives verification.
- γ (gamma) — the speculation length. How many tokens the drafter proposes per verification step.
- c — the cost ratio. The drafter’s per-token time divided by the target’s per-token time. A 1B-parameter drafter against a 70B target has c ≈ 0.02–0.05; Medusa heads have c ≈ 0; EAGLE drafters have c in the 0.02–0.10 range depending on tree size.
The expected number of accepted tokens per speculation step is approximately (1 - α^(γ+1)) / (1 - α). The expected cost per speculation step is approximately γ·c + 1 target-step-equivalents (γ drafter steps plus one verification). The speedup over baseline decoding is the ratio: tokens-per-step divided by cost-per-step.
A worked example. Suppose α = 0.7, γ = 4, c = 0.05. Expected accepted tokens per step is (1 - 0.7⁵)/(1 - 0.7) ≈ 2.79. Expected cost is 4·0.05 + 1 = 1.20 target steps. Speedup is 2.79 / 1.20 ≈ 2.32×. Plug in α = 0.55 instead — about the threshold where speculation is breakeven — and you get (1 - 0.55⁵)/(1 - 0.55) ≈ 2.10 tokens at cost 1.20, speedup 1.75×. Drop α to 0.4 and the speedup collapses to 1.42× — still positive, but small enough that the KV-cache reservation overhead in the batch scheduler can wipe it out. Below α ≈ 0.5 the math gets ugly fast; published vLLM acceptance-rate measurements suggest production deployments target α ≥ 0.65, ideally α ≥ 0.75.
Two structural consequences of this formula. First, γ has an optimum, not a monotonic curve. Bigger γ proposes more tokens but each marginal token is harder to accept (the joint probability that all of them are right decays geometrically with γ). Practical γ in production sits at 3–7 for autoregressive drafters and 2–4 for Medusa-style heads. Second, α is workload-dependent, not model-dependent. A drafter that gets α = 0.8 on stylistically-uniform chat traffic might drop to α = 0.55 on code-generation traffic, because code has higher token-level entropy than prose. The same draft model on the same target model can have a 25-point swing in acceptance rate across two workloads, which is why every framework now ships per-workload acceptance dashboards.
The distributed-systems parallel
The closest analogue is read-ahead and speculative I/O in a kernel filesystem. When you read() a small chunk of a file, the kernel doesn’t fetch only that chunk from disk — it speculatively reads ahead, fetching the next several blocks into page cache on the assumption that you’ll ask for them next. If your access pattern is sequential, the speculation pays off and subsequent reads are cache hits; if it’s random, the read-ahead is wasted I/O and the kernel disables it. Speculative decoding is exactly the same trick at a different layer: prefetch tokens optimistically, validate them against the source of truth, fall back to the slow path on a miss.
The deeper parallel is two-phase commit with the drafter as the prepare phase. The drafter “prepares” a multi-token transaction by committing tentatively to a continuation; the target either commits the whole prefix in one shot or commits a shorter prefix plus a correction. This is the same shape as a Saga where the compensating action is trivial (drop the rejected tokens). The reason speculative decoding can run in front of customer traffic without disclaimers — the same reason 2PC can run safely under user-facing systems — is that the commit point preserves the same semantics as the synchronous version. The target’s output distribution is unchanged; only its latency is.
A real disanalogy that bites. CPU branch prediction’s misprediction cost is bounded by the pipeline depth (10–20 cycles on a modern x86); speculative decoding’s misprediction cost is bounded by γ drafter steps, which is a much larger fraction of the total work budget. The misprediction rate you can tolerate is therefore much lower than what a branch predictor needs — modern branch predictors hit 95–99% accuracy; speculative decoding needs only ~70%. Different operating points on a similar-shape trade-off. The other disanalogy: a CPU’s branch predictor learns from runtime behavior (BTB, history tables, hashed PC); most production speculative decoders are static — the drafter weights are fixed at deployment, and acceptance rates drift over time as your traffic mix changes. Online learning for drafters is an active research area (see Arctic Inference) but not yet the production default.
Mechanics: vanilla speculative decoding (Leviathan)
The original algorithm in pseudocode:
- Drafter
M_qgenerates γ tokens autoregressively:x_1, …, x_γ, with per-token probabilitiesq(x_i | prefix, x_<i). - Target
M_pruns one forward pass on the concatenation[prefix, x_1, …, x_γ], producing target probabilitiesp(· | prefix, x_<i)for each position i ∈ {1, …, γ+1}. - For each i from 1 to γ: accept
x_iwith probabilitymin(1, p(x_i | ·) / q(x_i | ·)); on rejection, sample a correction token from the residual distribution(p − q)⁺, append the accepted prefix + correction, and stop. - If all γ tokens were accepted, sample one additional token from
p(· | prefix, x_1, …, x_γ)(the “bonus token” — free from the same forward pass) and continue.
The modified rejection sampling in step 3 is what guarantees output-distribution equivalence. The bonus token in step 4 is why the expected accepted-tokens formula is (1 − α^(γ+1)) / (1 − α) and not (1 − α^γ) / (1 − α) — when speculation lands all γ tokens, you get a free (γ+1)th token from the target’s own predictions. This is also why γ has a sweet spot rather than just a monotonic improvement curve: each additional speculation slot has lower marginal probability of being accepted, but the bonus-token term holds for all γ.
The drafter is typically a much smaller model from the same model family — TinyLlama (1.1B) as drafter for Llama-70B, Qwen2.5-0.5B as drafter for Qwen2.5-72B, Llama-3.2-1B as drafter for Llama-3.1-70B. Sharing the tokenizer is required (the drafter and target must agree on what x_i means as an integer ID). Sharing the training data distribution matters a lot for α — drafters trained on similar corpora to the target consistently outperform off-the-shelf small models, which is why production deployments increasingly use trained-for-speculation drafters (EAGLE, Medusa) rather than independently-trained small models.
Mechanics: Medusa heads
Medusa (Cai et al., ICML 2024) removes the separate drafter model entirely by adding extra prediction heads to the target itself. The recipe: take a frozen target model, add k small feedforward heads on top of the last hidden layer, train each head to predict the (i+1)-th, (i+2)-th, …, (i+k)-th tokens past the current position. At inference time, one forward pass produces k+1 token predictions (the regular next-token plus the k Medusa-head predictions); a tree-attention verifier then validates a tree of candidate continuations (the top-N completions from each head) in one additional forward pass.
Three structural properties make Medusa attractive in practice. (1) c ≈ 0 — the Medusa heads are negligibly cheap compared to a full target forward pass, so γ drafter steps in the cost formula collapses to ~0 and the speedup becomes tokens-per-step / 1. (2) Tree-attention parallel verification — by verifying N^k candidate continuations in one forward pass (with a custom attention mask that prevents cross-candidate contamination), Medusa effectively raises the effective α through brute-force enumeration of likely paths. (3) Typical acceptance — Medusa uses a softer acceptance criterion than rejection sampling, admitting any candidate whose target probability exceeds a threshold tied to the entropy of the target distribution. The trade-off is a small bias relative to true target sampling; the benefit is that high-temperature workloads (where rejection sampling collapses to near-zero acceptance) become viable.
What Medusa pays for these properties: the heads have to be trained per target model, they consume a small amount of extra GPU memory (typically <5% of model size), and they’re stateful in a way that interacts with batch scheduling — the tree-attention kernel has to be re-instantiated per batch shape, which the Medusa github repo documents as one of the harder integration points.
Mechanics: EAGLE family (1, 2, 3, 3.1)
EAGLE takes Medusa’s “use the target’s own representations” insight and runs further with it. Instead of adding token-prediction heads, EAGLE adds a single small autoregressive decoder that consumes the target’s penultimate-layer hidden states (its features) and predicts the next several features, which then go through the target’s own LM head to produce token probabilities. Two structural advantages over Medusa: the feature-level decoder can be trained to handle multi-step prediction more cleanly than independent token heads, and the LM head is shared with the target (no extra parameters for the vocabulary projection).
EAGLE-2 added dynamic tree depth — adjusting the speculation tree size per token based on confidence — which improved α by 0.05–0.10 on standard benchmarks. EAGLE-3 changed the feature target from penultimate-layer hidden states to a learned combination of multiple intermediate layers, which gave another acceptance-rate bump on long-context workloads. EAGLE-3.1, released on the vLLM blog on May 26, 2026, added robustness improvements that the team reports as up to 2× longer acceptance length compared to EAGLE-3 — meaning the average number of accepted tokens per speculation step roughly doubles on the workloads they benchmarked. The vLLM integration is mature enough that you load EAGLE-3.1 weights with a single config flag:
| |
The configuration constraint worth knowing: EAGLE drafters need draft_tensor_parallel_size: 1, even when the target model is tensor-parallel across multiple GPUs. The EAGLE drafter is small enough that splitting it across GPUs costs more in communication than it saves in compute. The vLLM speculative decoding docs walk through the configuration options in depth; the Red Hat developer post has measured throughput numbers on Llama-3.1-70B that match what production deployments report.
Mechanics: prompt-lookup and suffix decoding
The other end of the spectrum: don’t train a drafter at all. Prompt-lookup decoding (Apoorv Saxena’s original implementation, now integrated into Hugging Face Transformers as prompt_lookup_num_tokens and into vLLM as the ngram speculative method) treats the prompt and the generation-so-far as a corpus, and on each speculation step searches the corpus for the most recent k-token substring that ends with the current token, then proposes the k tokens that followed it last time. No model, no training, no extra GPU memory — just a string-search.
The intuition: workloads with high self-repetition (RAG-grounded generation where the answer quotes the context, code generation with repeated identifiers, summarization that lifts phrases from the source, agent loops that re-emit boilerplate) have near-deterministic next-token distributions over short windows. Prompt-lookup catches exactly that pattern. On RAG and code workloads it routinely hits α = 0.6–0.8; on free-form chat (where there’s no corpus to lookup against) it falls to α = 0.2–0.3 and provides no speedup.
Suffix decoding is the next refinement: build a trie over the prompt’s suffixes so the lookup is O(1) instead of O(prompt-length), enabling longer γ. The headline number from the Snowflake Arctic paper: 1.4–3.9× faster than n-gram speculation on the same workloads. vLLM supports both methods natively as the ngram speculative configuration. The advantage of these methods over model-based drafters is operational: there’s nothing to train, nothing to deploy, nothing to keep in sync with model upgrades. The disadvantage is the workload-dependence — outside the self-repetitive cases they degrade gracefully but provide no benefit.
Code: measuring acceptance rate and TPOT against a vLLM EAGLE-3 server
The code below runs against a vLLM server with EAGLE-3 enabled. Launch the server:
| |
Then drive it from the Anthropic-compatible OpenAI client (or any compatible client) and measure both wall-clock TPOT and acceptance rate. vLLM exposes acceptance metrics via Prometheus, but the per-request signal is recoverable from the chat completion’s usage.spec_decode_metrics field on V1.
| |
Two operational notes. First, the parse_metrics helper scrapes the Prometheus endpoint — easier than parsing vLLM’s structured logs but tied to the metric names vLLM ships in V1; check /metrics on your version. Second, the code workload should land α ≈ 0.7+ and the chat workload α ≈ 0.5–0.6 in practice on EAGLE-3; if your numbers are far below those, the drafter/target pairing is mismatched (most often: tokenizer mismatch, or the drafter wasn’t trained on the same instruction-tuning data).
Code: a TypeScript load generator that compares speculation on vs off
The harness below measures TPOT under two server configurations — vanilla decoding and EAGLE-3 — and computes the realized speedup at the same concurrency level. Run it twice against two vLLM instances on different ports, or against the same instance with /admin/disable-spec-decode toggled if your version supports it.
| |
What this harness is designed to expose: speculation’s gain is not uniform across concurrency. At concurrency=1 you’ll typically see a clean 2–3× TPOT improvement with EAGLE-3 on. At concurrency=4 the gain shrinks to maybe 1.4–1.8×. At concurrency=16 the gain often vanishes or inverts — the batch scheduler can’t reserve KV-cache slots for k speculative tokens per sequence times 16 sequences without evicting in-flight work. The right operational answer is adaptive speculation: turn it on when the engine has spare KV-cache capacity, off when it doesn’t. vLLM doesn’t yet ship this as a flag; the production pattern is either two server pools (a speculation-enabled pool for low-concurrency endpoints, a vanilla pool for high-throughput batch) or a per-request disable_spec_decode hint set by a routing layer that knows the current load.
Trade-offs, failure modes, gotchas
The acceptance-rate dashboard lies if you only look at the mean. α drifts with workload, and means hide bimodal distributions. A deployment serving both code-completion (α ≈ 0.8) and chat (α ≈ 0.55) at 50/50 shows mean α = 0.675 on the dashboard — which looks healthy — while half the traffic is below the break-even threshold. Bucket the metric by route, model, prompt-length quartile, and time-of-day before you trust it. If you’re using vLLM’s Prometheus metrics, label them with route and prompt_length_bucket at the gateway and aggregate downstream.
KV-cache pressure is the silent throughput killer. Every in-flight sequence reserves slots in the KV cache; when speculation is on, the reservation is multiplied by (γ+1) to leave room for the speculative tokens that might get accepted. On a server already running near KV capacity, enabling speculation can cut the max concurrent sequences in half. The signal in vLLM logs is the cache_usage gauge climbing past ~85% during peak; when that happens, speculation should be turning off, not staying on. The BentoML production guide walks through how they handle this in practice — adaptive γ tied to current cache utilization.
Drafter/target divergence on long contexts. EAGLE and Medusa drafters are typically trained on context lengths in the 4k–32k range. At 128k+ context, the drafter’s feature predictions degrade well before the target’s do, and α drops sharply. The DeepSeek-V3 paper’s Multi-Token Prediction (MTP) module is one response to this — training the speculation module jointly with the target on the same long-context curriculum, so the divergence at long contexts is bounded. If your traffic has a long tail of long-context requests and you’re seeing α tank on those, swap in MTP or an equivalent long-context-trained drafter rather than tuning γ.
Temperature interacts with rejection sampling badly. Vanilla speculative decoding’s modified rejection sampling is exact only at temperature ≤ 1 with no top-k/top-p cutoffs. At temperature ≫ 1, the drafter and target distributions diverge enough that acceptance collapses. Medusa’s typical-acceptance scheme handles this better but is an approximation. If your workload uses high temperature deliberately (creative writing, brainstorming), speculation will either drop α to unusable levels or shift the output distribution — pick your poison. Most production deployments either disable speculation when temperature > 0.8 or use Medusa-style typical acceptance with awareness of the bias.
Speculation interacts oddly with structured-output constraints. When grammar-constrained decoding is on (JSON schema, regex, FSM), the drafter doesn’t know about the constraints; it proposes tokens that the verifier then rejects whenever they violate the grammar. The effective α can drop to single digits because the drafter is essentially making blind guesses against a grammar it can’t see. The standard fix is to share the grammar mask with the drafter — the same FSM that prunes the target’s logits also prunes the drafter’s — which works for simple grammars but adds latency and complexity for tree-shaped ones.
Speculation isn’t free on TTFT. Yesterday’s article drew the prefill/decode split. Speculation accelerates decode but adds drafter-initialization overhead to the first speculation step after prefill. The penalty is small (1–2 ms on EAGLE-3, near-zero on Medusa) but consistent. If your workload is short-output-dominated (output_tokens ≤ ~20), speculation’s overhead can outweigh its benefit. Measure end-to-end latency including the short-output bucket, not just TPOT in steady state.
Multi-batched speculation is the next frontier. P-EAGLE and similar 2026 work attack the batch-degradation problem directly by speculating across multiple sequences in parallel — sharing the drafter’s compute across the batch the way the target already shares its weights. Production-grade as of mid-2026 in some setups but not yet a vLLM default. If you’re hitting the concurrency ceiling on EAGLE-3 today, this is the lever to track.
Further reading from the field
- Fast Inference from Transformers via Speculative Decoding — Leviathan, Kalman, Matias (Google, ICML 2023). The foundational paper that introduced the modified-rejection-sampling formulation and proved output-distribution equivalence to the target. Read this first.
- Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads — Cai et al. (ICML 2024). The drafter-free approach that grafts prediction heads onto the target, with typical-acceptance as the relaxed acceptance criterion. The Medusa github repo has the reference implementation.
- EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty — Li et al. (2024). The feature-level autoregressive drafter that became the production default; EAGLE-2/3/3.1 are progressive refinements documented in subsequent papers and the EAGLE github repo.
- Speculative Decoding: Performance or Illusion? — a 2026 measurement study of speculative-decoding in vLLM across workloads, model scales, and batch sizes. The strongest empirical treatment of when speculation actually pays off in production-shaped traffic.
What to read next
- Cost Optimization and Model Routing — the next article in the Production & Operations subtree. Where speculation cuts the cost of each call, routing cuts the cost of which call you make; together they’re the two dominant levers in the same dollar budget.
- Inference Latency: Prefill, Decode, and Batching — the prerequisite under this article. Continuous batching and chunked prefill define the scheduling regime that speculation interacts with; the gain math here only makes sense against the prefill/decode split laid out there.
- Quantization and Distillation: Compression for Inference — the orthogonal compression lever. Speculation amortizes the per-step HBM read across multiple tokens; quantization shrinks the read itself. Production deployments use both, and the gains stack rather than compete.
- Prompt Caching: Reusing the KV Cache Across Calls — the prefill-side cousin of this article’s decode-side optimization. Cached prefill + speculative decode together attack both phases of the inference call.