$ cat ai-engineering/speculative-decoding.md

Speculative Decoding and Draft Models

Speculative decoding with draft models, Medusa, EAGLE, and prompt lookup for lower decode latency.

Jatin Bansal@blog:~/ai-engineering$ open speculative-decoding

Speculative decoding can improve time per output token in an idle benchmark and reduce throughput under a heavily batched workload. Its value depends on draft acceptance, verification cost, KV-cache pressure, and scheduler load. Measurements therefore need the production batch-size distribution, not only single-request latency.

Acceptance and speedup

Three numbers govern whether speculation pays off, in the framing from Leviathan et al.:

α (alpha); the acceptance rate. The probability that a single drafter-proposed token survives verification.
γ (gamma); the speculation length. How many tokens the drafter proposes per verification step.
c; the cost ratio. The drafter’s per-token time divided by the target’s per-token time. A 1B-parameter drafter against a 70B target has c ≈ 0.02–0.05; Medusa heads have c ≈ 0; EAGLE drafters have c in the 0.02–0.10 range depending on tree size.

The expected number of accepted tokens per speculation step is approximately (1 - α^(γ+1)) / (1 - α). The expected cost per speculation step is approximately γ·c + 1 target-step-equivalents (γ drafter steps plus one verification). The speedup over baseline decoding is the ratio: tokens-per-step divided by cost-per-step.

A worked example. Suppose α = 0.7, γ = 4, c = 0.05. Expected accepted tokens per step is (1 - 0.7⁵)/(1 - 0.7) ≈ 2.79. Expected cost is 4·0.05 + 1 = 1.20 target steps. Speedup is 2.79 / 1.20 ≈ 2.32×. Plug in α = 0.55 instead; about the threshold where speculation is breakeven; and you get (1 - 0.55⁵)/(1 - 0.55) ≈ 2.10 tokens at cost 1.20, speedup 1.75×. Drop α to 0.4 and the speedup collapses to 1.42×; still positive, but small enough that the KV-cache reservation overhead in the batch scheduler can wipe it out. Below α ≈ 0.5 the math gets ugly fast; published vLLM acceptance-rate measurements suggest production deployments target α ≥ 0.65, ideally α ≥ 0.75.

The proposal length γ has an optimum because each additional token is less likely to be accepted. Practical values often sit at 3–7 for autoregressive drafters and 2–4 for Medusa-style heads. Acceptance α also depends on workload: a drafter can perform well on uniform chat traffic and worse on high-entropy code generation.

Draft-and-verify decoding

The original algorithm in pseudocode:

Drafter M_q generates γ tokens autoregressively: x_1, …, x_γ, with per-token probabilities q(x_i | prefix, x_<i).
Target M_p runs one forward pass on the concatenation [prefix, x_1, …, x_γ], producing target probabilities p(· | prefix, x_<i) for each position i ∈ {1, …, γ+1}.
For each i from 1 to γ: accept x_i with probability min(1, p(x_i | ·) / q(x_i | ·)); on rejection, sample a correction token from the residual distribution (p − q)⁺, append the accepted prefix + correction, and stop.
If all γ tokens were accepted, sample one additional token from p(· | prefix, x_1, …, x_γ) (the “bonus token”; free from the same forward pass) and continue.

The modified rejection sampling in step 3 is what guarantees output-distribution equivalence. The bonus token in step 4 is why the expected accepted-tokens formula is (1 − α^(γ+1)) / (1 − α) and not (1 − α^γ) / (1 − α); when speculation lands all γ tokens, you get a free (γ+1)th token from the target’s own predictions. This is also why γ has a sweet spot rather than just a monotonic improvement curve: each additional speculation slot has lower marginal probability of being accepted, but the bonus-token term holds for all γ.

The drafter is typically a much smaller model from the same model family; TinyLlama (1.1B) as drafter for Llama-70B, Qwen2.5-0.5B as drafter for Qwen2.5-72B, Llama-3.2-1B as drafter for Llama-3.1-70B. Sharing the tokenizer is required (the drafter and target must agree on what x_i means as an integer ID). Sharing the training data distribution matters a lot for α; drafters trained on similar corpora to the target consistently outperform off-the-shelf small models, which is why production deployments increasingly use trained-for-speculation drafters (EAGLE, Medusa) rather than independently-trained small models.

Medusa

Medusa (Cai et al., ICML 2024) removes the separate drafter model entirely by adding extra prediction heads to the target itself. The recipe: take a frozen target model, add k small feedforward heads on top of the last hidden layer, train each head to predict the (i+1)-th, (i+2)-th, …, (i+k)-th tokens past the current position. At inference time, one forward pass produces k+1 token predictions (the regular next-token plus the k Medusa-head predictions); a tree-attention verifier then validates a tree of candidate continuations (the top-N completions from each head) in one additional forward pass.

Medusa heads are cheap relative to a target-model forward pass. Tree attention verifies several candidate continuations in parallel with a mask that prevents cross-candidate contamination. Its typical-acceptance rule is softer than rejection sampling and can introduce small distributional bias in exchange for better acceptance at higher temperature.

What Medusa pays for these properties: the heads have to be trained per target model, they consume a small amount of extra GPU memory (typically <5% of model size), and they’re stateful in a way that interacts with batch scheduling; the tree-attention kernel has to be re-instantiated per batch shape, which the Medusa github repo documents as one of the harder integration points.

EAGLE

EAGLE takes Medusa’s “use the target’s own representations” insight and runs further with it. Instead of adding token-prediction heads, EAGLE adds a single small autoregressive decoder that consumes the target’s penultimate-layer hidden states (its features) and predicts the next several features, which then go through the target’s own LM head to produce token probabilities. Two structural advantages over Medusa: the feature-level decoder can be trained to handle multi-step prediction more cleanly than independent token heads, and the LM head is shared with the target (no extra parameters for the vocabulary projection).

EAGLE-2 added dynamic tree depth; adjusting the speculation tree size per token based on confidence; which improved α by 0.05–0.10 on standard benchmarks. EAGLE-3 changed the feature target from penultimate-layer hidden states to a learned combination of multiple intermediate layers, which gave another acceptance-rate bump on long-context workloads. EAGLE-3.1, released on the vLLM blog on May 26, 2026, added robustness improvements that the team reports as up to 2× longer acceptance length compared to EAGLE-3; meaning the average number of accepted tokens per speculation step roughly doubles on the workloads they benchmarked. The vLLM integration is mature enough that you load EAGLE-3.1 weights with a single config flag:

text

1
--speculative-config '{"model":"yuhuili/EAGLE3-LLaMA-3.1-Instruct-8B","method":"eagle3","num_speculative_tokens":3}'

The configuration constraint worth knowing: EAGLE drafters need draft_tensor_parallel_size: 1, even when the target model is tensor-parallel across multiple GPUs. The EAGLE drafter is small enough that splitting it across GPUs costs more in communication than it saves in compute. The vLLM speculative decoding docs walk through the configuration options in depth; the Red Hat developer post has measured throughput numbers on Llama-3.1-70B that match what production deployments report.

Prompt lookup and suffix decoding

The other end of the spectrum: don’t train a drafter at all. Prompt-lookup decoding (Apoorv Saxena’s original implementation, now integrated into Hugging Face Transformers as prompt_lookup_num_tokens and into vLLM as the ngram speculative method) treats the prompt and the generation-so-far as a corpus, and on each speculation step searches the corpus for the most recent k-token substring that ends with the current token, then proposes the k tokens that followed it last time. No model, no training, no extra GPU memory; just a string-search.

The intuition: workloads with high self-repetition (RAG-grounded generation where the answer quotes the context, code generation with repeated identifiers, summarization that lifts phrases from the source, agent loops that re-emit boilerplate) have near-deterministic next-token distributions over short windows. Prompt-lookup catches exactly that pattern. On RAG and code workloads it routinely hits α = 0.6–0.8; on free-form chat (where there’s no corpus to lookup against) it falls to α = 0.2–0.3 and provides no speedup.

Suffix decoding is the next refinement: build a trie over the prompt’s suffixes so the lookup is O(1) instead of O(prompt-length), enabling longer γ. The headline number from the Snowflake Arctic paper: 1.4–3.9× faster than n-gram speculation on the same workloads. vLLM supports both methods natively as the ngram speculative configuration. The advantage of these methods over model-based drafters is operational: there’s nothing to train, nothing to deploy, nothing to keep in sync with model upgrades. The disadvantage is the workload-dependence; outside the self-repetitive cases they degrade gracefully but provide no benefit.

Benchmarking EAGLE-3 with vLLM

The code below runs against a vLLM server with EAGLE-3 enabled. Launch the server:

bash

1
2
3
# pip install vllm
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --speculative-config '{"model":"yuhuili/EAGLE3-LLaMA-3.1-Instruct-8B","method":"eagle3","num_speculative_tokens":3}'

Then drive it from the Anthropic-compatible OpenAI client (or any compatible client) and measure both wall-clock TPOT and acceptance rate. vLLM exposes acceptance metrics via Prometheus, but the per-request signal is recoverable from the chat completion’s usage.spec_decode_metrics field on V1.

python

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
# pip install openai prometheus-client requests
import os
import statistics
import time
from dataclasses import dataclass

import requests
from openai import OpenAI

client = OpenAI(
    base_url=os.environ.get("OPENAI_BASE_URL", "http://localhost:8000/v1"),
    api_key="not-needed-for-local",
)

VLLM_METRICS_URL = os.environ.get("VLLM_METRICS_URL", "http://localhost:8000/metrics")


@dataclass
class SpecTrace:
    e2e_ms: float
    tpot_ms: float
    output_tokens: int


def parse_metrics() -> dict[str, float]:
    """Scrape vLLM's Prometheus metrics and return acceptance-rate signals."""
    resp = requests.get(VLLM_METRICS_URL, timeout=2)
    resp.raise_for_status()
    out: dict[str, float] = {}
    for line in resp.text.splitlines():
        if line.startswith("#") or not line.strip():
            continue
        # vllm:spec_decode_num_accepted_tokens_total{...} 12345
        name_end = line.find("{") if "{" in line else line.find(" ")
        name = line[:name_end]
        if name in {
            "vllm:spec_decode_num_accepted_tokens_total",
            "vllm:spec_decode_num_draft_tokens_total",
            "vllm:spec_decode_num_emitted_tokens_total",
        }:
            value = float(line.rsplit(" ", 1)[1])
            out[name] = out.get(name, 0.0) + value
    return out


def measure_one(prompt: str, max_tokens: int = 256) -> SpecTrace:
    """Drive one streaming request and report wall-clock TPOT + e2e."""
    t0 = time.perf_counter()
    stream = client.chat.completions.create(
        model="meta-llama/Llama-3.1-8B-Instruct",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=max_tokens,
        stream=True,
        stream_options={"include_usage": True},
    )

    first_token = None
    last_token = None
    output_tokens = 0
    for event in stream:
        if event.choices and event.choices[0].delta.content:
            now = time.perf_counter()
            if first_token is None:
                first_token = now
            last_token = now
        if event.usage:
            output_tokens = event.usage.completion_tokens

    t_end = time.perf_counter()
    if first_token is None or last_token is None:
        raise RuntimeError("no tokens emitted")

    # TPOT measured from first-token to last-token over (output_tokens - 1) gaps.
    # Falls back to e2e/output_tokens when only one token was produced.
    decode_span_ms = (last_token - first_token) * 1000
    tpot_ms = decode_span_ms / max(1, output_tokens - 1)
    return SpecTrace(e2e_ms=(t_end - t0) * 1000, tpot_ms=tpot_ms, output_tokens=output_tokens)


def run_acceptance_benchmark(prompts: list[str], n_each: int = 20) -> dict[str, float]:
    """For each prompt, measure TPOT and compute acceptance rate as
    accepted_tokens / draft_tokens over the duration of the run."""
    m_start = parse_metrics()
    traces: list[SpecTrace] = []
    for prompt in prompts:
        for _ in range(n_each):
            traces.append(measure_one(prompt))
    m_end = parse_metrics()

    accepted = m_end.get("vllm:spec_decode_num_accepted_tokens_total", 0.0) - m_start.get(
        "vllm:spec_decode_num_accepted_tokens_total", 0.0
    )
    drafted = m_end.get("vllm:spec_decode_num_draft_tokens_total", 0.0) - m_start.get(
        "vllm:spec_decode_num_draft_tokens_total", 0.0
    )
    emitted = m_end.get("vllm:spec_decode_num_emitted_tokens_total", 0.0) - m_start.get(
        "vllm:spec_decode_num_emitted_tokens_total", 0.0
    )

    alpha = accepted / drafted if drafted > 0 else float("nan")
    tokens_per_step = emitted / max(1, drafted / 3)  # γ=3 from launch config
    tpots = [t.tpot_ms for t in traces]
    return {
        "alpha": alpha,
        "mean_tokens_per_step": tokens_per_step,
        "tpot_p50_ms": statistics.median(tpots),
        "tpot_mean_ms": statistics.mean(tpots),
    }


if __name__ == "__main__":
    # Two workloads with different self-repetition profiles.
    code_prompt = (
        "Refactor the following Python function to use a dict comprehension:\n\n"
        "def to_lookup(items):\n"
        "    out = {}\n"
        "    for item in items:\n"
        "        out[item.id] = item.value\n"
        "    return out\n"
    )
    chat_prompt = "Explain how speculative decoding interacts with continuous batching."

    print("-- code workload --")
    print(run_acceptance_benchmark([code_prompt], n_each=20))
    print("-- chat workload --")
    print(run_acceptance_benchmark([chat_prompt], n_each=20))

The parse_metrics helper uses vLLM’s Prometheus endpoint and therefore depends on the metric names in the installed version. Acceptance rates far below the expected workload baseline usually indicate a tokenizer or drafter-target training mismatch.

References

Fast Inference from Transformers via Speculative Decoding; Leviathan, Kalman, Matias (Google, ICML 2023). The foundational paper that introduced the modified-rejection-sampling formulation and proved output-distribution equivalence to the target. Read this first.
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads; Cai et al. (ICML 2024). The drafter-free approach that grafts prediction heads onto the target, with typical-acceptance as the relaxed acceptance criterion. The Medusa github repo has the reference implementation.
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty; Li et al. (2024). The feature-level autoregressive drafter that became the production default; EAGLE-2/3/3.1 are progressive refinements documented in subsequent papers and the EAGLE github repo.
Speculative Decoding: Performance or Illusion?; a 2026 measurement study of speculative-decoding in vLLM across workloads, model scales, and batch sizes. The strongest empirical treatment of when speculation actually pays off in production-shaped traffic.