$ cat ai-engineering/inference-latency.md

Inference Latency: Prefill, Decode, and Batching

Inference latency across prefill, decode, batching, chunked prefill, and disaggregated serving.

Jatin Bansal@blog:~/ai-engineering$ open inference-latency

Self-hosting changes who owns the inference scheduler. The same model can show a better median and a much worse p99 when long prompts share batches with short, actively decoding conversations. Understanding that result requires separating queueing, prefill, and decode, then measuring how the server schedules each phase.

Definition

Inference latency for an LLM is the wall-clock time a request spends inside the serving stack, decomposed into three orthogonal phases; queueing, prefill, and decode; and measured by three metrics that capture different parts of the user experience: TTFT, TPOT, and end-to-end latency. Queueing time is how long the request sat waiting before the scheduler picked it up. Prefill time is the compute-bound forward pass over the input prompt that produces the KV cache and the first output token. Decode time is the memory-bandwidth-bound loop that auto-regressively generates each subsequent token. The metrics: TTFT (time-to-first-token) equals queueing plus prefill; this is the latency the user perceives until streaming starts. TPOT (time-per-output-token), sometimes called ITL (inter-token latency), is the average gap between consecutive decoded tokens; this is the latency the user perceives during streaming. End-to-end latency is TTFT plus TPOT × output token count; this is the latency a non-streaming caller sees. A 200-token answer at TTFT=200ms and TPOT=30ms is e2e = 200 + 200×30 = 6.2s; TTFT dominates the perceived experience for streaming UIs and decode dominates the cost.

The three phases respond to different optimizations. Queueing time falls when the scheduler can fit a new request into an in-flight batch; the whole point of continuous batching. Prefill time falls when the server splits long prompts into chunks that coexist with decode steps; chunked prefill; or when a stable prefix is already cached on the GPU; prompt caching. Decode time per token falls when the GPU’s memory bandwidth is fully saturated, which means more concurrent decodes in the same batch, which means more memory pressure on the KV cache, which is why PagedAttention exists. The dials interact, and a well-tuned server is one where TTFT and TPOT trade off cleanly against batch size and total throughput rather than collapsing into the bimodal worst case the team in the opener hit.

Mechanics: prefill vs decode, what each phase does

The split is mechanical. A request arrives with prompt_tokens = [t₁, t₂, …, tₙ]. The server runs a single forward pass over all n tokens in parallel; every token attends to every prior token, producing the K/V tensors for each layer. This is prefill. At the end, the server samples the (n+1)th token from the model’s output distribution. Total work is O(n²) in the attention layers (every token attends to every other) and O(n) in the FFN layers; modern GPUs saturate compute around a prompt length of a few thousand tokens for a 70B-class model. TTFT is dominated by this single forward pass plus any queueing time before the scheduler admitted the request.

After prefill the request enters the decode loop. Each iteration: feed the previously sampled token through one forward pass, where the new token’s query attends to the entire stored K/V cache (no recompute needed), sample the next token from the output distribution, append it to the KV cache. Each step touches one query token against a growing K/V state; the FLOPs are negligible, but the memory traffic; reading the model weights and the full K/V cache from HBM on every step; is the bottleneck. A 70B-class model at FP16 reads ~140 GB of weights per step; on an H100 with ~3 TB/s of HBM bandwidth, the theoretical floor for TPOT is ~50ms per token for one sequence. The whole point of batching is to amortize that 140 GB read across many concurrent decodes; at batch size 16, the per-token cost in bandwidth is 1/16th, and TPOT can fall toward ~3ms per token in the limit. This is why a busy inference server with many concurrent users is cheaper per token than an idle one. (Quantization attacks the same equation from the other side: an INT4-quantized 70B model reads ~35 GB per step instead of 140 GB, dropping the single-sequence TPOT floor by 4×.)

A consequence that bites: the optimal batch shape changes between prefill and decode, and the scheduler has to manage both. A naive batch of “everyone’s prefill plus everyone’s decode” stalls because the prefill phase dominates GPU time and the decode tokens pile up waiting. The 2022 Orca paper called this the static-batching failure mode and observed that throughput could improve by an order of magnitude when batching happened at the iteration level; the scheduler picks which sequences participate in each forward pass independently, with prefills and decodes coexisting only when their compute profiles allow it.

Mechanics: continuous batching (the Orca primitive)

Iteration-level scheduling, the technique Orca introduced and every modern inference engine implements, runs one forward pass per scheduler decision rather than one request per scheduler decision. At the start of each iteration the scheduler picks a batch of sequences to run in this forward pass: each sequence has its current K/V state, its next-token-to-process (a prompt token for prefilling sequences, a previously-sampled token for decoding sequences), and a free slot in the batch tensor. The forward pass runs. Each sequence’s output is appended to its K/V state; decoding sequences also emit one sampled token. Then the scheduler picks the next iteration’s batch; and that batch can be different: a newly-arrived request that wants to start prefilling can join, a sequence that just hit EOS leaves, an in-flight long-output decode continues.

The vocabulary that grew up around this: continuous batching (the marketing name, popularized by Anyscale’s 2023 blog post on 23× throughput), iteration-level batching (Orca’s term), in-flight batching (NVIDIA’s term in TensorRT-LLM). They mean the same thing. The contrast is with static batching, where a batch is formed once at admission and runs to completion before the next batch starts; static batching wastes GPU cycles whenever any sequence in the batch finishes early, and it forces newly arrived requests to wait for the entire batch to drain before they can start prefilling.

What continuous batching doesn’t solve on its own is the head-of-line blocking problem when a long prefill enters the batch. A single 50k-token prefill takes ~500ms-2s of GPU time (model- and hardware-dependent); during that forward pass, every in-flight decode sees its next token delayed by exactly that amount. TPOT for users mid-conversation spikes from 30ms to 1000ms+. This is the bimodal latency mode the team in the opener was hitting. The fix is the next primitive.

Mechanics: chunked prefill (the Sarathi primitive)

Chunked prefill splits a long prompt’s prefill into many smaller forward passes; chunks of say 512-2048 tokens; and piggybacks decode steps into the same forward pass as each prefill chunk. The technique came from the 2023 SARATHI paper and was extended into a production scheduler in Sarathi-Serve at OSDI ‘24. Each scheduled iteration carries a budget of max_num_batched_tokens; say 2048; and the scheduler fills that budget greedily: in-flight decodes (one token each), then as many new prefill chunks as fit. A 50k-token prefill becomes 25 iterations of 2048 tokens each, and during all 25 iterations the in-flight decodes are still making progress; one token per iteration, no stalling.

The trade-off is direct. Smaller max_num_batched_tokens (e.g. 2048) gives better TPOT; fewer prefill tokens per iteration means decodes don’t get drowned out, and inter-token latency stays low. Larger max_num_batched_tokens (e.g. 8192–16384) gives better TTFT; more prefill tokens per iteration means the prefilling sequence finishes its first token sooner. The vLLM docs frame this as: small for ITL-sensitive workloads (chat, coding assistants), large for TTFT-sensitive workloads (single-question UX, batch document processing). The default in vLLM V1 enables chunked prefill automatically; the Hugging Face TGI --max-batch-prefill-tokens and --waiting-served-ratio knobs control the same trade-off with different names. Anthropic’s June 2024 piggyback batching post (joint with Databricks Mosaic) reports 2-3× throughput improvements over un-chunked vLLM at SLO-pinned latencies on standard chat workloads.

A subtler benefit: chunked prefill also smooths the service-time distribution in queueing-theory terms. Without it, prefill service times have a long right tail (proportional to prompt length); with it, every iteration’s service time is bounded by max_num_batched_tokens, and the per-iteration GPU time becomes nearly constant. Kingman’s formula says waiting time scales with the variance of service time; bounding service-time variance shaves p99 latency even when the mean is unchanged.

Mechanics: prefill/decode disaggregation

Even with chunked prefill, prefill and decode contend for the same GPU’s compute and memory. The next architectural step, introduced in the 2024 DistServe paper and Microsoft’s contemporaneous SplitWise, is to run prefill on one pool of GPUs and decode on a separate pool, with the KV cache shipped from prefill-GPU to decode-GPU between phases. DistServe reports up to 7.4× higher goodput (requests per second meeting both TTFT and TPOT SLOs) and 12.6× tighter SLO bounds at the same load compared to co-located serving, by eliminating the prefill/decode interference entirely. SplitWise extends this further by using different hardware classes for the two pools; compute-optimized GPUs (H100, B100) for prefill, memory-bandwidth-optimized GPUs (A100, GH200) for decode; squeezing more throughput-per-dollar from the heterogeneous hardware match. The Hao AI Lab retrospective from 18 months later reports that disaggregation has now become production infrastructure across NVIDIA Dynamo, llm-d, vLLM (experimental disaggregated prefill), SGLang, and Moonshot AI’s MoonCake; disaggregation is the production-default for high-scale serving in 2026.

When does disaggregation pay off? When prefill-bound workloads dominate (very long context inputs, RAG over large retrieved chunks, document-summarization batch jobs) the prefill pool can scale independently of decode; when decode-bound workloads dominate (short prompts, long outputs, agent-style streaming) the decode pool scales independently. Co-located serving makes the wrong trade-off in both directions: prefill-bound traffic starves decode, decode-bound traffic leaves prefill compute idle. For symmetric chat traffic (mixed prompt and output lengths) chunked prefill on a co-located server is usually enough; for asymmetric or long-context-heavy workloads, disaggregation is the lever.

Code: measuring TTFT, TPOT, and end-to-end latency in Python

The code below measures TTFT, TPOT, and end-to-end latency against a local vLLM OpenAI-compatible server. Install: pip install openai. Start vLLM with vllm serve Qwen/Qwen3-4B-Instruct --max-num-batched-tokens 2048 (or your model and chunk size of choice). The same client also works against TGI or hosted endpoints with OPENAI_BASE_URL overridden; the wire format is the same.

python

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
# pip install openai
import os
import statistics
import time
from dataclasses import dataclass

from openai import OpenAI

client = OpenAI(
    base_url=os.environ.get("OPENAI_BASE_URL", "http://localhost:8000/v1"),
    api_key="not-needed-for-local",
)


@dataclass
class LatencyTrace:
    ttft_ms: float
    tpot_ms: float
    e2e_ms: float
    output_tokens: int


def measure_one(prompt: str, max_tokens: int = 256, model: str = "Qwen/Qwen3-4B-Instruct") -> LatencyTrace:
    """One streaming call, with per-token timing.

    TTFT is measured from request send to the first non-empty content delta.
    TPOT is the mean inter-token gap *after* the first token. e2e is wall time.
    """
    t0 = time.perf_counter()
    stream = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=max_tokens,
        stream=True,
        # request usage in the final chunk for accurate token counts
        stream_options={"include_usage": True},
    )

    first_token_time = None
    token_times: list[float] = []
    completion_tokens = 0

    for event in stream:
        # Per-chunk events; last chunk carries `usage` and no content.
        if event.choices and event.choices[0].delta.content:
            now = time.perf_counter()
            if first_token_time is None:
                first_token_time = now
            else:
                token_times.append(now)
        if event.usage:
            completion_tokens = event.usage.completion_tokens

    t_end = time.perf_counter()
    if first_token_time is None:
        raise RuntimeError("no tokens emitted")

    ttft_ms = (first_token_time - t0) * 1000
    # mean inter-token gap; fall back to e2e/output for single-token responses
    if len(token_times) >= 1:
        gaps = [
            (b - a) * 1000
            for a, b in zip([first_token_time] + token_times[:-1], token_times)
        ]
        tpot_ms = statistics.mean(gaps)
    else:
        tpot_ms = 0.0
    e2e_ms = (t_end - t0) * 1000

    return LatencyTrace(ttft_ms, tpot_ms, e2e_ms, completion_tokens)


def percentile(values: list[float], p: float) -> float:
    if not values:
        return float("nan")
    s = sorted(values)
    k = (len(s) - 1) * p / 100
    f = int(k)
    return s[f] + (k - f) * (s[min(f + 1, len(s) - 1)] - s[f])


def run_benchmark(prompts: list[str], n_each: int = 20) -> dict[str, float]:
    """Sequential probe to characterize an idle server; for load testing,
    drive concurrent workers (asyncio + a semaphore) and merge their traces."""
    traces: list[LatencyTrace] = []
    for prompt in prompts:
        for _ in range(n_each):
            traces.append(measure_one(prompt))

    ttfts = [t.ttft_ms for t in traces]
    tpots = [t.tpot_ms for t in traces]
    e2es = [t.e2e_ms for t in traces]
    return {
        "ttft_p50_ms": percentile(ttfts, 50),
        "ttft_p99_ms": percentile(ttfts, 99),
        "tpot_p50_ms": percentile(tpots, 50),
        "tpot_p99_ms": percentile(tpots, 99),
        "e2e_p50_ms": percentile(e2es, 50),
        "e2e_p99_ms": percentile(e2es, 99),
    }


if __name__ == "__main__":
    # Mixed prompt lengths to exercise the chunked-prefill scheduler.
    short_prompt = "Write a haiku about distributed systems."
    long_prompt = "Summarize the key points in this passage:\n\n" + ("Continuous batching is iteration-level scheduling. " * 200)
    results = run_benchmark([short_prompt, long_prompt], n_each=10)
    for k, v in results.items():
        print(f"{k}: {v:.1f}")

the stream_options={"include_usage": True} flag is how vLLM and OpenAI return accurate output token counts in streaming mode; without it, you’d have to tokenize the response yourself. The sequential probe in run_benchmark characterizes the idle server; to find the bimodal-latency regime the team in the opener hit, you need a concurrent load generator that holds the server at ~70-90% utilization while a single long-prompt request lands. The latency under load is what tells you whether your chunked-prefill budget is set right. A good rule of thumb: run the same harness twice, once with --max-num-batched-tokens 2048 and once with --max-num-batched-tokens 16384; the gap between the two TTFT p99s and the gap between the two TPOT p99s reveal the trade-off frontier for your traffic.

Tuning the dials: what to set and why

The four most operationally important knobs on a vLLM-class server:

max_num_batched_tokens (vLLM) / --max-batch-prefill-tokens (TGI). The total token budget per scheduler iteration. Smaller (1024-2048) prioritizes TPOT; larger (8192-16384) prioritizes TTFT. Default in vLLM V1 is workload-tuned and varies by version; explicit setting is recommended for production deployments. Tune by running the latency harness above at two values and choosing the one that hits your SLO mix.

max_num_seqs (vLLM) / --max-concurrent-requests (TGI). The max number of in-flight sequences the scheduler will admit into a batch. Set this near the GPU’s KV-cache capacity at your typical context length; setting it too high causes preemption thrashing (sequences get evicted to make room and have to recompute prefill), setting it too low underutilizes decode bandwidth. The vLLM logs print KV-cache utilization at admission; if it sits at 95-100% during peak load you’re undersized; if it sits at <50% during peak you have headroom to raise concurrency.

Chunked prefill on/off. Default-on in vLLM V1 for most cases. Turn it off only if you’re running an exclusively prefill-dominated workload (e.g. batch document summarization) where there are no concurrent decodes to interleave with; in that case, the iteration overhead is pure cost.

Prefix caching on/off. vLLM V1’s automatic prefix caching is the open-weights analogue of Anthropic/OpenAI prompt caching: a content-addressed cache of K/V tensors for prompt prefixes, with <1% throughput overhead even at 0% hit rate (V1’s optimization). Leave it on. Then structure your prompts so prefixes are stable; system prompt and tool schemas at the top, retrieved context in the middle, user message at the bottom; to maximize cache hit rate.

For disaggregated serving the vLLM disaggregated prefill docs cover the experimental V1 path; for production-scale deployments, NVIDIA Dynamo and SGLang’s PD disaggregation docs are the maturity leaders in mid-2026.