jatin.blog ~ $
$ cat ai-engineering/inference-latency.md

Inference Latency: Prefill, Decode, and Batching

Inside the inference server: prefill vs decode, continuous batching, chunked prefill, prefill/decode disaggregation, TTFT/TPOT — and the dials.

Jatin Bansal@blog:~/ai-engineering$ open inference-latency

A team migrates a chat product from a hosted API to a self-hosted vLLM cluster on H100s to cut cost. The dashboard the next morning is confusing: median latency is better than the hosted endpoint, but p99 has doubled, and the worst tail at peak load is 10× the median. On hosted infrastructure, latency was nearly flat across load; now it’s bimodal — a fast mode when the server is idle, a slow mode when a long prompt arrives in the same batch as in-flight short conversations. Same model, same hardware class, very different scheduling. The migration didn’t change the model; it changed who owns the batching policy. Every latency number a downstream system observes is a function of three things: how the inference server splits the work between prefill and decode, how it batches in-flight requests, and how it prevents long prompts from stalling everyone else’s tokens.

Opening bridge

The Evaluation subtree closed last week with human-feedback loops feeding production labels back into the eval suite. That subtree assumed the inference call was an opaque box behind a streaming API — token counts, latency numbers, dollar costs measured at the wire. This article opens the box. Everything the LLM inference fundamentals piece framed as the prefill/decode split — the O(n²) prefill that’s compute-bound, the memory-bandwidth-bound decode loop — turns into a scheduling problem the moment you serve more than one request at a time. The first article in the Production & Operations subtree walks through how modern inference servers actually solve that problem, and what the dials you’ll be tuning (max_num_batched_tokens, max_num_seqs, chunked prefill on/off, prefill/decode disaggregation) actually mean.

Definition

Inference latency for an LLM is the wall-clock time a request spends inside the serving stack, decomposed into three orthogonal phases — queueing, prefill, and decode — and measured by three metrics that capture different parts of the user experience: TTFT, TPOT, and end-to-end latency. Queueing time is how long the request sat waiting before the scheduler picked it up. Prefill time is the compute-bound forward pass over the input prompt that produces the KV cache and the first output token. Decode time is the memory-bandwidth-bound loop that auto-regressively generates each subsequent token. The metrics: TTFT (time-to-first-token) equals queueing plus prefill; this is the latency the user perceives until streaming starts. TPOT (time-per-output-token), sometimes called ITL (inter-token latency), is the average gap between consecutive decoded tokens; this is the latency the user perceives during streaming. End-to-end latency is TTFT plus TPOT × output token count; this is the latency a non-streaming caller sees. A 200-token answer at TTFT=200ms and TPOT=30ms is e2e = 200 + 200×30 = 6.2s — TTFT dominates the perceived experience for streaming UIs and decode dominates the cost.

The three phases respond to different optimizations. Queueing time falls when the scheduler can fit a new request into an in-flight batch — the whole point of continuous batching. Prefill time falls when the server splits long prompts into chunks that coexist with decode steps — chunked prefill — or when a stable prefix is already cached on the GPU — prompt caching. Decode time per token falls when the GPU’s memory bandwidth is fully saturated, which means more concurrent decodes in the same batch, which means more memory pressure on the KV cache, which is why PagedAttention exists. The dials interact, and a well-tuned server is one where TTFT and TPOT trade off cleanly against batch size and total throughput rather than collapsing into the bimodal worst case the team in the opener hit.

Intuition

The mental model: prefill and decode are two different workloads on the same GPU, and pretending they’re one job is what makes naive servers slow. Prefill is a matrix-matrix multiply (large activation × large weights) that saturates a modern GPU’s compute units; doubling the prompt length roughly doubles the prefill FLOPs, and a long-enough prompt will completely occupy the GPU for hundreds of milliseconds. Decode is one row of a matrix-vector multiply per token (a single new query against the KV cache); it consumes negligible compute but is bottlenecked on reading the entire model weights and KV state out of HBM at every step. Stack one prefill on top of a fleet of in-flight decodes and the decodes stall — the GPU is busy with prefill, and every decode request waiting for its next token sees TPOT spike. This is the classic head-of-line blocking that Sarathi-Serve’s chunked prefill was designed to eliminate.

The complementary frame: batching is the only way to amortize the fixed cost of a forward pass, and the batch you want at prefill is the opposite of the batch you want at decode. Prefill is compute-bound; the optimal batch size is small (one or two prompts) because the matrix multiply already saturates compute and adding more parallel work just queues. Decode is memory-bandwidth-bound; the optimal batch size is large (dozens of concurrent sequences) because every concurrent decode reads the same model weights from HBM and amortizes the bandwidth cost across many tokens. A server that runs both phases on the same GPU has to constantly switch between these two regimes, and the scheduling primitive that lets them coexist gracefully — continuous batching — is the single biggest unlock in modern LLM serving.

The third frame, the queueing-theory parallel: a busy LLM server is an M/G/k queue whose service-time distribution is heavy-tailed because prefill time scales with prompt length. Standard queueing intuition — utilization above ~70% explodes the wait time — applies, but with an extra wrinkle: a single 50k-token prefill admitted into the batch can stall every short-request decode behind it for seconds. The tail of the latency distribution is dominated by the tail of the prompt-length distribution, not the request rate. Sizing capacity by mean QPS is what gives you the bimodal latency the team in the opener hit; sizing by 99th-percentile prompt length is what gives you predictable p99 TPOT.

The distributed-systems parallel

The closest analogue is a database query scheduler with read-write contention. Decode is the read workload — lots of short, cheap, latency-sensitive queries that benefit from parallel execution against a hot cache. Prefill is the write workload — a heavy, throughput-sensitive batch operation that, if scheduled on the same core as the reads, starves them. Database servers solved this decades ago with separate read and write paths, read replicas, write-ahead logging that lets reads proceed during writes, and admission control on long writes. The inference-serving equivalents map one-to-one: prefill/decode disaggregation onto separate GPUs is the read-replica pattern; chunked prefill is the long-write-broken-into-small-chunks-so-reads-can-proceed pattern; continuous batching is the iteration-level work-stealing scheduler that lets new requests jump into a running batch without restarting it. The 2022 Orca paper introduced iteration-level scheduling for LLMs by porting exactly this idea — schedule at the granularity of one forward pass, not one request — from systems work that had been standard in database servers for decades.

The deeper parallel runs all the way to queueing networks. A production inference cluster is a network of queues: an ingress queue at the load balancer, an admission queue at each replica, a scheduler queue inside the engine that admits requests into the running batch, a KV-cache memory pool that can preempt requests when full. Each queue has its own discipline (FCFS, priority, fairness), its own service-time distribution, and its own backpressure signal. The same Little’s-Law-and-Kingman-formula reasoning that applies to a microservice fleet applies here: utilization × variability is the master variable for queueing delay, and the lever you have is to reduce variability (chunked prefill turns a single long prefill from a heavy-tailed service time into a sequence of bounded ones).

A real disanalogy worth flagging. The CPU/GPU memory hierarchy makes the scheduling problem more constrained than a database server’s: every in-flight request holds GPU memory for its KV cache (megabytes to gigabytes for long contexts), and admitting a new request requires evicting or paging out an existing one. PagedAttention’s blocked allocation made this tractable by treating the KV cache as virtual memory with page tables, but the capacity ceiling is real — you cannot just scale concurrency by throwing more threads at the problem. The OS-virtual-memory parallel is the right one here, not the unlimited-thread-pool one. The agent-harness piece framed the harness as the operating system for the agent; the inference engine is the operating system for the GPU.

Mechanics: prefill vs decode, what each phase does

The split is mechanical. A request arrives with prompt_tokens = [t₁, t₂, …, tₙ]. The server runs a single forward pass over all n tokens in parallel — every token attends to every prior token, producing the K/V tensors for each layer. This is prefill. At the end, the server samples the (n+1)th token from the model’s output distribution. Total work is O(n²) in the attention layers (every token attends to every other) and O(n) in the FFN layers; modern GPUs saturate compute around a prompt length of a few thousand tokens for a 70B-class model. TTFT is dominated by this single forward pass plus any queueing time before the scheduler admitted the request.

After prefill the request enters the decode loop. Each iteration: feed the previously sampled token through one forward pass, where the new token’s query attends to the entire stored K/V cache (no recompute needed), sample the next token from the output distribution, append it to the KV cache. Each step touches one query token against a growing K/V state; the FLOPs are negligible, but the memory traffic — reading the model weights and the full K/V cache from HBM on every step — is the bottleneck. A 70B-class model at FP16 reads ~140 GB of weights per step; on an H100 with ~3 TB/s of HBM bandwidth, the theoretical floor for TPOT is ~50ms per token for one sequence. The whole point of batching is to amortize that 140 GB read across many concurrent decodes — at batch size 16, the per-token cost in bandwidth is 1/16th, and TPOT can fall toward ~3ms per token in the limit. This is why a busy inference server with many concurrent users is cheaper per token than an idle one. (Quantization attacks the same equation from the other side: an INT4-quantized 70B model reads ~35 GB per step instead of 140 GB, dropping the single-sequence TPOT floor by 4×.)

A consequence that bites: the optimal batch shape changes between prefill and decode, and the scheduler has to manage both. A naive batch of “everyone’s prefill plus everyone’s decode” stalls because the prefill phase dominates GPU time and the decode tokens pile up waiting. The 2022 Orca paper called this the static-batching failure mode and observed that throughput could improve by an order of magnitude when batching happened at the iteration level — the scheduler picks which sequences participate in each forward pass independently, with prefills and decodes coexisting only when their compute profiles allow it.

Mechanics: continuous batching (the Orca primitive)

Iteration-level scheduling, the technique Orca introduced and every modern inference engine implements, runs one forward pass per scheduler decision rather than one request per scheduler decision. At the start of each iteration the scheduler picks a batch of sequences to run in this forward pass: each sequence has its current K/V state, its next-token-to-process (a prompt token for prefilling sequences, a previously-sampled token for decoding sequences), and a free slot in the batch tensor. The forward pass runs. Each sequence’s output is appended to its K/V state; decoding sequences also emit one sampled token. Then the scheduler picks the next iteration’s batch — and that batch can be different: a newly-arrived request that wants to start prefilling can join, a sequence that just hit EOS leaves, an in-flight long-output decode continues.

The vocabulary that grew up around this: continuous batching (the marketing name, popularized by Anyscale’s 2023 blog post on 23× throughput), iteration-level batching (Orca’s term), in-flight batching (NVIDIA’s term in TensorRT-LLM). They mean the same thing. The contrast is with static batching, where a batch is formed once at admission and runs to completion before the next batch starts; static batching wastes GPU cycles whenever any sequence in the batch finishes early, and it forces newly arrived requests to wait for the entire batch to drain before they can start prefilling.

What continuous batching doesn’t solve on its own is the head-of-line blocking problem when a long prefill enters the batch. A single 50k-token prefill takes ~500ms-2s of GPU time (model- and hardware-dependent); during that forward pass, every in-flight decode sees its next token delayed by exactly that amount. TPOT for users mid-conversation spikes from 30ms to 1000ms+. This is the bimodal latency mode the team in the opener was hitting. The fix is the next primitive.

Mechanics: chunked prefill (the Sarathi primitive)

Chunked prefill splits a long prompt’s prefill into many smaller forward passes — chunks of say 512-2048 tokens — and piggybacks decode steps into the same forward pass as each prefill chunk. The technique came from the 2023 SARATHI paper and was extended into a production scheduler in Sarathi-Serve at OSDI ‘24. Each scheduled iteration carries a budget of max_num_batched_tokens — say 2048 — and the scheduler fills that budget greedily: in-flight decodes (one token each), then as many new prefill chunks as fit. A 50k-token prefill becomes 25 iterations of 2048 tokens each, and during all 25 iterations the in-flight decodes are still making progress — one token per iteration, no stalling.

The trade-off is direct. Smaller max_num_batched_tokens (e.g. 2048) gives better TPOT — fewer prefill tokens per iteration means decodes don’t get drowned out, and inter-token latency stays low. Larger max_num_batched_tokens (e.g. 8192–16384) gives better TTFT — more prefill tokens per iteration means the prefilling sequence finishes its first token sooner. The vLLM docs frame this as: small for ITL-sensitive workloads (chat, coding assistants), large for TTFT-sensitive workloads (single-question UX, batch document processing). The default in vLLM V1 enables chunked prefill automatically; the Hugging Face TGI --max-batch-prefill-tokens and --waiting-served-ratio knobs control the same trade-off with different names. Anthropic’s June 2024 piggyback batching post (joint with Databricks Mosaic) reports 2-3× throughput improvements over un-chunked vLLM at SLO-pinned latencies on standard chat workloads.

A subtler benefit: chunked prefill also smooths the service-time distribution in queueing-theory terms. Without it, prefill service times have a long right tail (proportional to prompt length); with it, every iteration’s service time is bounded by max_num_batched_tokens, and the per-iteration GPU time becomes nearly constant. Kingman’s formula says waiting time scales with the variance of service time; bounding service-time variance shaves p99 latency even when the mean is unchanged.

Mechanics: prefill/decode disaggregation

Even with chunked prefill, prefill and decode contend for the same GPU’s compute and memory. The next architectural step, introduced in the 2024 DistServe paper and Microsoft’s contemporaneous SplitWise, is to run prefill on one pool of GPUs and decode on a separate pool, with the KV cache shipped from prefill-GPU to decode-GPU between phases. DistServe reports up to 7.4× higher goodput (requests per second meeting both TTFT and TPOT SLOs) and 12.6× tighter SLO bounds at the same load compared to co-located serving, by eliminating the prefill/decode interference entirely. SplitWise extends this further by using different hardware classes for the two pools — compute-optimized GPUs (H100, B100) for prefill, memory-bandwidth-optimized GPUs (A100, GH200) for decode — squeezing more throughput-per-dollar from the heterogeneous hardware match. The Hao AI Lab retrospective from 18 months later reports that disaggregation has now become production infrastructure across NVIDIA Dynamo, llm-d, vLLM (experimental disaggregated prefill), SGLang, and Moonshot AI’s MoonCake — disaggregation is the production-default for high-scale serving in 2026.

When does disaggregation pay off? When prefill-bound workloads dominate (very long context inputs, RAG over large retrieved chunks, document-summarization batch jobs) the prefill pool can scale independently of decode; when decode-bound workloads dominate (short prompts, long outputs, agent-style streaming) the decode pool scales independently. Co-located serving makes the wrong trade-off in both directions: prefill-bound traffic starves decode, decode-bound traffic leaves prefill compute idle. For symmetric chat traffic (mixed prompt and output lengths) chunked prefill on a co-located server is usually enough; for asymmetric or long-context-heavy workloads, disaggregation is the lever.

Code: measuring TTFT, TPOT, and end-to-end latency in Python

The code below measures TTFT, TPOT, and end-to-end latency against a local vLLM OpenAI-compatible server. Install: pip install openai. Start vLLM with vllm serve Qwen/Qwen3-4B-Instruct --max-num-batched-tokens 2048 (or your model and chunk size of choice). The same client also works against TGI or hosted endpoints with OPENAI_BASE_URL overridden — the wire format is the same.

python
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
# pip install openai
import os
import statistics
import time
from dataclasses import dataclass

from openai import OpenAI

client = OpenAI(
    base_url=os.environ.get("OPENAI_BASE_URL", "http://localhost:8000/v1"),
    api_key="not-needed-for-local",
)


@dataclass
class LatencyTrace:
    ttft_ms: float
    tpot_ms: float
    e2e_ms: float
    output_tokens: int


def measure_one(prompt: str, max_tokens: int = 256, model: str = "Qwen/Qwen3-4B-Instruct") -> LatencyTrace:
    """One streaming call, with per-token timing.

    TTFT is measured from request send to the first non-empty content delta.
    TPOT is the mean inter-token gap *after* the first token. e2e is wall time.
    """
    t0 = time.perf_counter()
    stream = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=max_tokens,
        stream=True,
        # request usage in the final chunk for accurate token counts
        stream_options={"include_usage": True},
    )

    first_token_time = None
    token_times: list[float] = []
    completion_tokens = 0

    for event in stream:
        # Per-chunk events; last chunk carries `usage` and no content.
        if event.choices and event.choices[0].delta.content:
            now = time.perf_counter()
            if first_token_time is None:
                first_token_time = now
            else:
                token_times.append(now)
        if event.usage:
            completion_tokens = event.usage.completion_tokens

    t_end = time.perf_counter()
    if first_token_time is None:
        raise RuntimeError("no tokens emitted")

    ttft_ms = (first_token_time - t0) * 1000
    # mean inter-token gap; fall back to e2e/output for single-token responses
    if len(token_times) >= 1:
        gaps = [
            (b - a) * 1000
            for a, b in zip([first_token_time] + token_times[:-1], token_times)
        ]
        tpot_ms = statistics.mean(gaps)
    else:
        tpot_ms = 0.0
    e2e_ms = (t_end - t0) * 1000

    return LatencyTrace(ttft_ms, tpot_ms, e2e_ms, completion_tokens)


def percentile(values: list[float], p: float) -> float:
    if not values:
        return float("nan")
    s = sorted(values)
    k = (len(s) - 1) * p / 100
    f = int(k)
    return s[f] + (k - f) * (s[min(f + 1, len(s) - 1)] - s[f])


def run_benchmark(prompts: list[str], n_each: int = 20) -> dict[str, float]:
    """Sequential probe to characterize an idle server; for load testing,
    drive concurrent workers (asyncio + a semaphore) and merge their traces."""
    traces: list[LatencyTrace] = []
    for prompt in prompts:
        for _ in range(n_each):
            traces.append(measure_one(prompt))

    ttfts = [t.ttft_ms for t in traces]
    tpots = [t.tpot_ms for t in traces]
    e2es = [t.e2e_ms for t in traces]
    return {
        "ttft_p50_ms": percentile(ttfts, 50),
        "ttft_p99_ms": percentile(ttfts, 99),
        "tpot_p50_ms": percentile(tpots, 50),
        "tpot_p99_ms": percentile(tpots, 99),
        "e2e_p50_ms": percentile(e2es, 50),
        "e2e_p99_ms": percentile(e2es, 99),
    }


if __name__ == "__main__":
    # Mixed prompt lengths to exercise the chunked-prefill scheduler.
    short_prompt = "Write a haiku about distributed systems."
    long_prompt = "Summarize the key points in this passage:\n\n" + ("Continuous batching is iteration-level scheduling. " * 200)
    results = run_benchmark([short_prompt, long_prompt], n_each=10)
    for k, v in results.items():
        print(f"{k}: {v:.1f}")

Two things to flag. First, the stream_options={"include_usage": True} flag is how vLLM and OpenAI return accurate output token counts in streaming mode — without it, you’d have to tokenize the response yourself. Second, the sequential probe in run_benchmark characterizes the idle server; to find the bimodal-latency regime the team in the opener hit, you need a concurrent load generator that holds the server at ~70-90% utilization while a single long-prompt request lands. The latency under load is what tells you whether your chunked-prefill budget is set right. A good rule of thumb: run the same harness twice, once with --max-num-batched-tokens 2048 and once with --max-num-batched-tokens 16384; the gap between the two TTFT p99s and the gap between the two TPOT p99s reveal the trade-off frontier for your traffic.

Code: a TypeScript load generator with concurrent streams

The harness below holds the server at a configurable concurrency level, drives a mix of short and long prompts, and emits per-request TTFT/TPOT/e2e traces ready for percentile aggregation. Install: npm install openai.

typescript
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
// npm install openai
import OpenAI from "openai";

const client = new OpenAI({
  baseURL: process.env.OPENAI_BASE_URL ?? "http://localhost:8000/v1",
  apiKey: "not-needed-for-local",
});

interface LatencyTrace {
  ttftMs: number;
  tpotMs: number;
  e2eMs: number;
  outputTokens: number;
  promptKind: "short" | "long";
}

async function measureOne(
  prompt: string,
  promptKind: "short" | "long",
  model = "Qwen/Qwen3-4B-Instruct",
  maxTokens = 256,
): Promise<LatencyTrace> {
  const t0 = performance.now();
  const stream = await client.chat.completions.create({
    model,
    messages: [{ role: "user", content: prompt }],
    max_tokens: maxTokens,
    stream: true,
    stream_options: { include_usage: true },
  });

  let firstTokenTime: number | null = null;
  const tokenTimes: number[] = [];
  let outputTokens = 0;

  for await (const event of stream) {
    const delta = event.choices?.[0]?.delta?.content;
    if (delta) {
      const now = performance.now();
      if (firstTokenTime === null) firstTokenTime = now;
      else tokenTimes.push(now);
    }
    if (event.usage) outputTokens = event.usage.completion_tokens;
  }

  const tEnd = performance.now();
  if (firstTokenTime === null) throw new Error("no tokens emitted");

  const ttftMs = firstTokenTime - t0;
  let tpotMs = 0;
  if (tokenTimes.length >= 1) {
    const prev = [firstTokenTime, ...tokenTimes.slice(0, -1)];
    const gaps = tokenTimes.map((t, i) => t - prev[i]);
    tpotMs = gaps.reduce((a, b) => a + b, 0) / gaps.length;
  }
  const e2eMs = tEnd - t0;

  return { ttftMs, tpotMs, e2eMs, outputTokens, promptKind };
}

async function runConcurrent(
  prompts: Array<{ prompt: string; kind: "short" | "long" }>,
  concurrency: number,
  totalRequests: number,
): Promise<LatencyTrace[]> {
  const traces: LatencyTrace[] = [];
  let inflight = 0;
  let submitted = 0;
  const queue: Promise<void>[] = [];

  return new Promise((resolve) => {
    const drain = () => {
      while (inflight < concurrency && submitted < totalRequests) {
        const pick = prompts[submitted % prompts.length];
        submitted += 1;
        inflight += 1;
        const p = measureOne(pick.prompt, pick.kind)
          .then((t) => {
            traces.push(t);
          })
          .catch(() => {})
          .finally(() => {
            inflight -= 1;
            if (submitted >= totalRequests && inflight === 0) resolve(traces);
            else drain();
          });
        queue.push(p);
      }
    };
    drain();
  });
}

function percentile(values: number[], p: number): number {
  if (values.length === 0) return NaN;
  const s = [...values].sort((a, b) => a - b);
  const k = ((s.length - 1) * p) / 100;
  const f = Math.floor(k);
  return s[f] + (k - f) * (s[Math.min(f + 1, s.length - 1)] - s[f]);
}

async function main() {
  const longBody = "Continuous batching is iteration-level scheduling. ".repeat(200);
  const prompts = [
    { prompt: "Write a haiku about distributed systems.", kind: "short" as const },
    { prompt: `Summarize this passage:\n\n${longBody}`, kind: "long" as const },
  ];

  const traces = await runConcurrent(prompts, /*concurrency=*/ 16, /*total=*/ 200);

  for (const kind of ["short", "long"] as const) {
    const subset = traces.filter((t) => t.promptKind === kind);
    const ttfts = subset.map((t) => t.ttftMs);
    const tpots = subset.map((t) => t.tpotMs);
    console.log(`-- ${kind} prompts (n=${subset.length}) --`);
    console.log(`  ttft  p50=${percentile(ttfts, 50).toFixed(0)}ms  p99=${percentile(ttfts, 99).toFixed(0)}ms`);
    console.log(`  tpot  p50=${percentile(tpots, 50).toFixed(0)}ms  p99=${percentile(tpots, 99).toFixed(0)}ms`);
  }
}

main().catch(console.error);

Three operational notes. First, the harness deliberately mixes short and long prompts at the same time — that’s the configuration that exposes head-of-line blocking when chunked prefill is misconfigured. Running short-only or long-only benchmarks gives you a misleadingly flat latency distribution. Second, the concurrency level (16 in the example) should be set near the server’s max_num_seqs to actually exercise the batching scheduler; below that you’re measuring the idle-server case and won’t see the trade-offs. Third, the tpotMs measurement uses inter-token gaps rather than total decode time divided by token count, because the streaming channel can buffer multiple tokens per delta — the gap between adjacent content deltas is closer to what the user actually sees.

Tuning the dials: what to set and why

The four most operationally important knobs on a vLLM-class server:

max_num_batched_tokens (vLLM) / --max-batch-prefill-tokens (TGI). The total token budget per scheduler iteration. Smaller (1024-2048) prioritizes TPOT; larger (8192-16384) prioritizes TTFT. Default in vLLM V1 is workload-tuned and varies by version; explicit setting is recommended for production deployments. Tune by running the latency harness above at two values and choosing the one that hits your SLO mix.

max_num_seqs (vLLM) / --max-concurrent-requests (TGI). The max number of in-flight sequences the scheduler will admit into a batch. Set this near the GPU’s KV-cache capacity at your typical context length; setting it too high causes preemption thrashing (sequences get evicted to make room and have to recompute prefill), setting it too low underutilizes decode bandwidth. The vLLM logs print KV-cache utilization at admission — if it sits at 95-100% during peak load you’re undersized; if it sits at <50% during peak you have headroom to raise concurrency.

Chunked prefill on/off. Default-on in vLLM V1 for most cases. Turn it off only if you’re running an exclusively prefill-dominated workload (e.g. batch document summarization) where there are no concurrent decodes to interleave with — in that case, the iteration overhead is pure cost.

Prefix caching on/off. vLLM V1’s automatic prefix caching is the open-weights analogue of Anthropic/OpenAI prompt caching: a content-addressed cache of K/V tensors for prompt prefixes, with <1% throughput overhead even at 0% hit rate (V1’s optimization). Leave it on. Then structure your prompts so prefixes are stable — system prompt and tool schemas at the top, retrieved context in the middle, user message at the bottom — to maximize cache hit rate.

For disaggregated serving the vLLM disaggregated prefill docs cover the experimental V1 path; for production-scale deployments, NVIDIA Dynamo and SGLang’s PD disaggregation docs are the maturity leaders in mid-2026.

Trade-offs, failure modes, gotchas

Preemption thrashing. When KV-cache memory fills up, the scheduler evicts in-flight sequences — usually the longest-context ones — to make room for new arrivals. The evicted sequence’s K/V state is dropped and the request has to recompute prefill when scheduled again. Under sustained pressure this becomes a thrashing loop: requests are admitted, partially decoded, evicted, re-prefilled, evicted again, with no one making meaningful progress. The signal in vLLM logs is the preemption counter; if it’s nonzero at steady state you need to either lower max_num_seqs (refuse work earlier) or scale out (more replicas).

Chunked prefill is not free for very small prompts. Each iteration has a fixed scheduling and kernel-launch overhead. For prompts that fit in one chunk anyway, chunked prefill just adds the overhead with no benefit; for prompts in the 500-2000 token range, the per-iteration cost can be meaningful relative to the work. In practice this is dominated by the gain on long prompts, but it’s why some workloads see slight regressions when chunked prefill is enabled at default settings — tune the chunk size up if your traffic is short-prompt-heavy.

Disaggregation has a KV transfer cost. Shipping the prefill K/V tensors from the prefill GPU to the decode GPU consumes interconnect bandwidth and adds latency to the prefill-to-decode handoff. On NVLink-connected H100 pairs the transfer is fast (microseconds for typical K/V sizes); across PCIe or InfiniBand the cost is much higher and can wipe out the gain from disaggregation. DistServe’s original paper covers the math; the practical rule is that disaggregation pays off when prefill and decode are imbalanced enough that the transfer cost is amortized over many decode steps.

Speculative decoding interacts with batching in subtle ways. Speculative decoding uses a small draft model (or extra prediction heads on the target) to propose multiple tokens that the large model verifies in parallel — when it works, decode TPOT effectively halves. But speculative tokens occupy KV-cache memory whether they’re accepted or rejected, and the batch scheduler has to reserve room for the worst case. On a server already running near capacity, enabling speculation can paradoxically reduce throughput because it eats memory budget that would otherwise serve more concurrent decodes. The speculative decoding article walks through when speculation pays off, the acceptance-rate math, and the EAGLE/Medusa/ngram drafter landscape.

The latency you measure in isolation is not the latency you’ll see in production. A single-stream benchmark measures the idle-server latency. Production latency under realistic traffic is dominated by queueing under utilization, which is dominated by service-time variance, which is dominated by prompt-length tail. Always benchmark with a load generator that drives the server into its saturated regime — that’s the regime your SLO is going to live in.

Hosted endpoints are still continuously batching too. Self-hosting changes who owns the dials, not whether batching exists. Anthropic, OpenAI, Google, and Mistral all run the same kinds of inference engines under their APIs; their SLAs effectively encode their own choice of the TTFT/TPOT trade-off frontier. If you observe a latency spike on a hosted endpoint, what you’re seeing is the same scheduling math playing out at the provider’s scale — which is why “the API got slow today” shows up as a feature of the trace store, not a model regression.

Further reading from the field

  • Speculative Decoding and Draft Models — the direct sequel in the Production & Operations subtree. Once you’ve squeezed every drop out of prefill/decode scheduling, speculation is the next throughput lever — and the one with the most contention with the batching policies laid out here.
  • Cost Optimization and Model Routing — the application-layer cost lever that sits on top of the server-side optimizations covered here. Once each call is as cheap as the engine can make it, the next dollar comes from making fewer expensive calls.
  • Quantization and Distillation: Compression for Inference — the memory-pressure lever that stacks on top of batching. INT4/FP8 quantization cuts the per-step weight read 2–4×, which is the largest single TPOT improvement on top of continuous batching for memory-bound decode.
  • Prompt Caching: Reusing the KV Cache Across Calls — the cross-call extension of in-engine prefix caching. Cached prefill makes the prefill phase nearly free; that’s the highest-ROI dial when your prompts have stable prefixes.