jatin.blog ~ $
$ cat ai-engineering/quantization-distillation.md

Quantization and Distillation: Compression for Inference

How big models get cheap: the math behind INT8/INT4/FP8 quantization, GPTQ/AWQ/SmoothQuant, soft-target distillation, and the 2026 production stack.

Jatin Bansal@blog:~/ai-engineering$ open quantization-distillation

A 70B-parameter model in 16-bit precision needs 140 GB of weights plus activations — comfortably out of reach of a single GPU, and uncomfortable even on a small node. The same model in INT4 fits in 35 GB and runs on a single 48 GB consumer card. The model that ships those production tokens is not the model that came out of the post-training pipeline; it’s a compressed artifact derived from it. Two compression techniques sit at the heart of every production inference stack: quantization shrinks the precision of each weight (and sometimes each activation), and distillation shrinks the count of weights by training a smaller student to imitate a larger teacher. Both are lossy. The engineering question is not whether you lose quality — you do — but where the cliff is and how to stay on the safe side of it.

Opening bridge

Yesterday’s piece on LoRA and PEFT walked through QLoRA’s NF4 4-bit quantization as a training-time memory trick: the frozen base model sits in 4-bit so the optimizer state and activation memory fit on a single GPU. Today’s piece zooms back out to the more general technique. Quantization at inference time is a much larger surface than QLoRA’s training-time application — it’s how the deployed model gets cheap to serve, not just cheap to fine-tune. And distillation closes the loop on the Fine-Tuning vs RAG decision tree: when fine-tuning a large model is the right answer but serving it isn’t, distill the behavior into a smaller model and serve that. Together they are the production-side counterparts to the training-side compression of LoRA — same family of “trade precision for cost” tricks, applied to inference rather than fine-tuning.

This is also the last article in the curriculum. The training-side fundamentals subtree that started with pre-training and RLHF closes here, and the loop back to inference is intentional: every compression decision below is a decision about how the artifacts from the training pipeline get deployed.

Definitions

Quantization is the process of representing a tensor with fewer bits per element. A 16-bit weight occupies 2 bytes; an INT8 weight occupies 1 byte; an INT4 weight occupies half a byte. The mapping from the original floating-point value to the quantized integer is a function q = round((x - z) / s), where s is the scale (the floating-point span each integer step covers) and z is the zero point (the integer that maps back to zero in the original space). Dequantization at inference time recovers an approximation: x' = q * s + z. The two free parameters — s and z — are chosen per tensor (or per channel, or per group of channels) to minimize the round-trip error, and the choice of granularity is the first major axis of the design space.

Distillation is the process of training a smaller student model to imitate the output of a larger teacher model. The 2015 paper that named the technique, Hinton, Vinyals, and Dean’s “Distilling the Knowledge in a Neural Network”, made the core observation that became the entire field: the teacher’s full probability distribution over the output classes carries far more information than the hard label alone, because the relative ranking of the wrong answers encodes structure about the input that the student can learn from. A teacher that says “this is 80% cat, 19% dog, 1% airplane” tells the student something a teacher that just says “cat” doesn’t — namely that the input is more dog-like than airplane-like. Training the student to match the soft distribution rather than the hard label is the entire pitch, and it works.

Both techniques compress; the difference is what they compress. Quantization keeps the same number of weights but stores each one cheaper. Distillation throws away weights entirely and recovers behavior through a different network shape. Most production stacks use both.

Intuition

The right way to see quantization: floats are wasteful in the regime LLM weights live in. A 16-bit weight has roughly 65,000 distinct representable values; a 32-bit weight has roughly 4 billion. But a trained transformer’s weights are tightly clustered around zero, with a long tail of outliers, and the meaningful dynamic range — the precision the network actually exploits during inference — is much narrower than the representation suggests. INT8 gives you 256 distinct values, which sounds catastrophically lossy until you recognize that a Gaussian-distributed weight matrix uses about 8 bits of effective dynamic range to begin with. The mismatch is what makes quantization work: you can throw away most of the float’s representational capacity without throwing away any of the network’s actual behavior.

The right way to see distillation: a trained network’s output distribution is a much richer training signal than its labels. Imagine training a classifier on a dataset where someone has annotated not just the right answer but their entire confidence distribution over all the possible answers. The hard-label dataset says “this is a cat.” The distilled dataset says “this is 80% cat because of the ears, 19% dog because of the body shape, 1% airplane because of the background sky.” A student trained on the soft-label dataset learns the similarity geometry of the task, not just the right answer. That geometry is what lets a small student keep up with a large teacher on a workload — the teacher has implicitly transferred its understanding of what makes inputs similar.

The distributed-systems parallel

Quantization is lossy compression for a content-addressable store. Filesystems and storage layers have spent decades making the same trade: store the data in a cheaper representation, accept some loss on read, recover the value with a decoder. JPEG, MP3, H.264 — all of them re-encode high-fidelity media into a smaller representation that’s “close enough” for the perceptual budget of the consumer. The recurring trick: identify the parts of the signal that don’t carry information for the downstream task, and throw them away. Quantization does the same thing for neural-network weights — the high-precision bits don’t carry information for the downstream task (matrix multiplication followed by a non-linearity), so they’re free to drop. The SmoothQuant paper (Xiao et al., 2022) calls out the exact analogue: weights are easy to quantize because their distribution is well-behaved, but activations have outliers that are the LLM equivalent of high-frequency detail in an image — drop them naively and the model collapses; redistribute them across the model first (the “smoothing” step) and they become quantizable.

Distillation is sparse summarization for a database. Materialized views in a database hold a precomputed, smaller representation of a larger underlying table; they’re cheaper to query, slower to update, and they trade representational fidelity for query latency. The view doesn’t know everything the underlying table knows — it only knows the projection that the materialization captured — but for the workload that built it, it’s strictly cheaper. Distillation is the materialized view of the teacher model: pre-compute the teacher’s responses to a representative slice of the input distribution, train the student on that slice, deploy the student. The student is faster at query time, slower (impossible) to update without re-running distillation, and it knows only what the teacher’s outputs on the calibration distribution captured. The DeepSeek-R1 release in early 2025 is the canonical recent example: 800k reasoning trajectories generated from the R1 teacher, used to fine-tune Qwen and Llama students that inherited a chunk of R1’s reasoning quality without inheriting its parameter count.

Quantization mechanics: bit widths, schemes, formats

The mechanical primitives are worth being precise about. The vocabulary is dense but each piece pulls weight in production.

Bit width. The headline number. The 2026 production lineup is INT8, INT4, FP8, and increasingly FP4. INT8 retains essentially full quality on most workloads at 2× memory savings vs FP16; INT4 cuts memory 4× but starts to show degradation on reasoning-heavy tasks; FP8 sits between the two on memory but pulls native hardware support on Hopper and Blackwell GPUs; FP4 is the Blackwell-era frontier and the place where current research is concentrated.

Quantization scheme. Symmetric vs asymmetric: symmetric maps [-α, α] to [-2^(b-1), 2^(b-1) - 1] with z=0; asymmetric maps [min, max] to [0, 2^b - 1] with z non-zero. Weights are usually symmetric (their distributions are roughly centered); activations are usually asymmetric (they’re often non-negative after ReLU/SiLU). Uniform vs non-uniform: uniform spaces the representable values evenly across the range; non-uniform (e.g. NF4, the NormalFloat 4-bit format from QLoRA) spaces them according to a target distribution — for normally-distributed weights, NF4 packs values quantile-spaced over the normal so each representable value covers about the same density of actual weights. NF4 is the reason QLoRA works at 4 bits where naive INT4 doesn’t.

Granularity. Where the (s, z) parameters are computed. Per-tensor: one scale for the whole matrix. Fast, brutal, loses quality on anything but small models. Per-channel: one scale per output channel of a linear layer. The current standard for weight quantization. Per-group: one scale per group of N consecutive weights inside a channel (typically N=64 or N=128). The best quality-per-bit at INT4; the cost is slightly more metadata. Per-token: one scale per input token for activations. Necessary for activation quantization because activation statistics change per token.

Weight-only vs weight-and-activation. Weight-only (W4A16, W8A16): quantize only the weights, keep activations in FP16 or BF16. The cheap, safe, default. Used for AWQ and GPTQ. Weight-and-activation (W8A8, W4A8): quantize both. Higher throughput on supporting hardware, but requires solving the activation-outlier problem. SmoothQuant, ZeroQuant, and FP8 fall here.

Static vs dynamic. Static (calibration-based): run a small calibration dataset through the model once to determine (s, z) per channel, freeze them. Dynamic: compute (s, z) per inference at runtime from the actual activation values. Dynamic is more accurate for activations but adds runtime overhead; static is the production default once a good calibration set exists.

Quantization algorithms: GPTQ, AWQ, SmoothQuant, FP8

The interesting work isn’t in the bit width — it’s in the algorithm that picks the per-channel scales and decides how to compensate for the quantization error.

GPTQ (Frantar et al., 2022). The Hessian-based heavyweight. GPTQ quantizes a transformer one layer at a time, and within each layer one column at a time. At each step it picks the integer-quantized value of the current column, then updates the remaining un-quantized weights to compensate for the quantization error using second-order information from the layer’s input-Hessian. The trick is borrowed from the Optimal Brain Quantization line of work — instead of treating quantization as a single rounding step, treat it as a sequence of greedy decisions where each decision shifts the optimization landscape for the next one. GPTQ quantized the largest publicly available models of its day (OPT-175B, BLOOM-176B) to 4 bits in about four GPU hours with minimal perplexity increase. The trade-off: it’s slow to run vs naive quantization, but the quality at 4 bits was the breakthrough that made 4-bit a serious production option.

AWQ (Lin et al., 2023). The activation-aware lightweight. AWQ’s central observation: not all weights are equally important to quantize well. About 1% of weights — the ones connected to the highest-magnitude activations — carry most of the information; the other 99% can be quantized aggressively. AWQ finds the salient weights by looking at the activation magnitudes (not the weight magnitudes — that’s the “activation-aware” name) and applies a per-channel scaling that amplifies the salient weights before quantization, then divides the activations by the same factor at runtime to cancel out the amplification. The math is exact — it’s a similarity transform — but the salient weights now occupy a wider slice of the quantization range, which preserves their precision. AWQ skips GPTQ’s slow Hessian computation and runs much faster, and the quality at INT4 is consistently equal to or better than GPTQ. AWQ is the production INT4 default in 2026; most pre-quantized 4-bit checkpoints on Hugging Face are AWQ.

SmoothQuant (Xiao et al., 2022). The W8A8 unlocker. The activation-outlier problem hits hard above 6.7B parameters: a small fraction of activation channels (often <1%) carry magnitudes 100× larger than the rest, and naive activation quantization either truncates those channels (catastrophic quality loss) or sets the scale to cover them (catastrophic precision loss on the other 99%). SmoothQuant’s fix: migrate the variance from activations to weights by an offline per-channel rescaling. Activations get smoother (easier to quantize), weights get spikier (still quantizable because they have headroom), and a similarity transform absorbs the rescaling so the model’s outputs are unchanged. The result is W8A8 quantization with quality at parity to W8A16, and on hardware with INT8 tensor cores the throughput doubles.

FP8. The hardware-native 8-bit float. Hopper and Blackwell GPUs include native FP8 tensor cores that compute matrix multiplications at 2× the rate of FP16. FP8 comes in two flavors: E4M3 (4 exponent bits, 3 mantissa bits — more precision, less range) and E5M2 (5 exponent bits, 2 mantissa bits — more range, less precision). Production use typically pairs E4M3 for weights and E5M2 for activations. The pitch: native hardware support means no special kernels, near-FP16 quality out of the box, and dramatic memory and throughput gains. TensorRT-LLM doubles throughput on H100 with FP8 at minimal accuracy cost, and on Blackwell the FP8 advantage compounds with the new FP4 path.

A summary table of the 2026 lineup:

FormatBitsWeight/ActCalibrationQuality at parityHardware
FP16/BF1616W16A16baselineuniversal
FP8 (E4M3/E5M2)8W8A8per-tensor or per-channelwithin ~0.1 ppHopper, Blackwell
INT8 (SmoothQuant)8W8A8per-channel + smoothingwithin ~0.2 ppbroad
INT8 (weight-only)8W8A16per-channelwithin ~0.1 ppuniversal
INT4 (AWQ)4W4A16activation-aware per-groupwithin ~0.5–1 ppbroad
INT4 (GPTQ)4W4A16Hessian-based per-groupwithin ~0.5–1 ppbroad
NF4 (QLoRA-style)4W4A16none (datatype is the trick)within ~0.5–1 ppbroad
FP4 (Blackwell)4W4A4per-tensor or per-channelwithin ~1 pp on early benchmarksBlackwell only

The honest 2026 default: FP8 if you’re on Hopper or Blackwell and need full quality; AWQ INT4 if memory or non-data-center deployment is the constraint; SmoothQuant W8A8 if INT8 throughput on broader hardware is the priority. GPTQ is still around but AWQ has largely replaced it as the production default.

Distillation mechanics: soft targets, temperature, dataset construction

The soft-target loss. The original distillation loss from Hinton et al. is a weighted combination of two terms: the standard hard-label cross-entropy and the soft-target KL divergence between the student and teacher distributions. The trick that makes the soft distribution useful is temperature scaling: divide the logits by T > 1 before the softmax. Higher T flattens the distribution, exposing the smaller probabilities the teacher assigns to the wrong answers. At T=1 the teacher’s confident answer drowns out everything else; at T=4 or T=10, the relative ranking of the wrong answers becomes visible and trainable. The loss is α · CE(student_T1, hard_label) + (1-α) · T² · KL(student_T || teacher_T), with T typically in [2, 10] and α typically around 0.5. The factor exists because the gradients of the softened KL are smaller by 1/T², and rescaling restores them to the same magnitude as the hard-label gradient.

Distillation regimes. Logit-level distillation (the Hinton original): the student learns to match the teacher’s full output distribution. Needs access to teacher logits — easy with open-weight teachers, impossible with closed-API teachers that only return text. Sequence-level distillation (the workhorse): generate text from the teacher, fine-tune the student on the (input, generated-output) pairs as standard SFT. Doesn’t need logits — works with any teacher you can query — but throws away most of the per-token distribution information. On-policy distillation (MiniLLM, Gu et al., 2023): student generates a response, teacher scores it, student updates via policy gradient with reverse-KL objective. Better calibration on long generations than forward-KL distillation, but needs teacher logits and is more expensive than offline SFT. The 2026 production default is sequence-level — closed-API teachers are too useful to give up — with on-policy used for capability transfer when access to teacher logits is available.

Dataset construction. The single most important variable. The distillation corpus needs to cover the student’s deployment distribution; if the corpus is off-distribution, the student will be off-distribution. The standard recipe: capture a representative sample of real user traffic, run it through the teacher, use the (input, teacher-output) pairs as training data. For reasoning tasks, capture the full chain-of-thought from the teacher — that’s what DeepSeek-R1 did with its 800k reasoning trajectories, and the result was small students with non-trivial chunks of the teacher’s reasoning quality. The corpus size matters less than the corpus distribution: 50k well-curated examples typically beat 500k randomly sampled ones.

Production examples. DistilBERT (Sanh et al., 2019) is the classic — 40% smaller than BERT, 60% faster, 95% of BERT’s quality on GLUE. The recipe (smaller architecture, distillation from the full BERT teacher) became the template the field has been iterating on since. The 2024-2026 evolution: distill smaller students from much larger teachers, on much larger corpora, with much more careful corpus construction. The DeepSeek-R1-Distill family — 1.5B through 70B students distilled from the R1 teacher — is the current canonical demonstration that distillation can transfer reasoning capabilities, not just pattern-matching, when the corpus is rich enough. OpenAI’s GPT-4o mini is widely understood to be a distillation of GPT-4o, though the recipe is undisclosed.

Code: post-training quantization in Python with llm-compressor

The 2026 idiomatic way to quantize a Hugging Face model for vLLM serving is llm-compressor, the official toolkit maintained by the vLLM team. It supports INT4 (GPTQ-style), INT8 (weight-only and W8A8), and FP8 with a consistent API. Install: pip install llmcompressor>=0.4 transformers>=4.46 datasets>=3.

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor.transformers import oneshot
from llmcompressor.modifiers.quantization import GPTQModifier, QuantizationModifier

MODEL = "meta-llama/Meta-Llama-3.1-8B-Instruct"
OUT_INT4 = "./Llama-3.1-8B-INT4"
OUT_FP8 = "./Llama-3.1-8B-FP8"

# Load the base model and tokenizer in fp16 (compression converts in place).
model = AutoModelForCausalLM.from_pretrained(MODEL, torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(MODEL)

# Calibration set: a small slice of representative inputs. 512 samples is the
# usual sweet spot — enough to determine per-channel statistics, small enough
# to run on a single GPU in minutes.
calib = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft[:512]")
calib = calib.map(lambda row: {
    "text": tokenizer.apply_chat_template(row["messages"], tokenize=False)
})
calib_texts = calib["text"]

# --------------------------------------------------------------------------
# INT4 weight-only with GPTQ-style Hessian-based quantization.
# This is the workhorse for memory-constrained serving — Llama-3.1-8B in INT4
# fits comfortably on a 16GB GPU vs 18GB+ for fp16.
# --------------------------------------------------------------------------
recipe_int4 = GPTQModifier(
    targets="Linear",
    scheme="W4A16",            # 4-bit weights, 16-bit activations
    ignore=["lm_head"],         # skip the output projection
    dampening_frac=0.1,         # Hessian regularization
)
oneshot(
    model=model,
    dataset=calib_texts,
    recipe=recipe_int4,
    max_seq_length=2048,
    num_calibration_samples=512,
    output_dir=OUT_INT4,
)

# --------------------------------------------------------------------------
# FP8 weight-and-activation quantization for Hopper/Blackwell deployment.
# Native hardware support — the FP8 tensor cores on H100/B200 double matmul
# throughput vs FP16 with essentially no quality loss.
# --------------------------------------------------------------------------
# Reload fresh model since the previous call mutated weights in-place.
model = AutoModelForCausalLM.from_pretrained(MODEL, torch_dtype="auto", device_map="auto")

recipe_fp8 = QuantizationModifier(
    targets="Linear",
    scheme="FP8",               # E4M3 weights, E5M2 activations
    ignore=["lm_head"],
)
oneshot(
    model=model,
    dataset=calib_texts,
    recipe=recipe_fp8,
    max_seq_length=2048,
    num_calibration_samples=512,
    output_dir=OUT_FP8,
)

# Resulting checkpoints can be served directly by vLLM:
#   vllm serve ./Llama-3.1-8B-INT4 --quantization compressed-tensors
#   vllm serve ./Llama-3.1-8B-FP8  --quantization compressed-tensors

Three pieces worth pulling out.

First, the calibration set. 512 samples is plenty for most workloads, but the distribution matters more than the count. If your production traffic is code completion, calibrate on code. If it’s chat, calibrate on chat. Calibration on out-of-distribution text will leave you with a quantized model that’s accurate on the calibration data but degraded on real inputs. The standard mistake is to calibrate on whatever Hugging Face dataset comes to mind first; the standard fix is to capture a small slice of real production prompts and use that.

Second, the ignore=["lm_head"] is the universal exception. The final output projection is sensitive to quantization in a way the inner layers aren’t, because it directly produces the token-probability distribution and any precision loss there compounds across the entire vocabulary. Skipping it is essentially free (one layer of FP16 weights out of dozens) and meaningfully reduces quality loss.

Third, the dual-path setup. INT4 for memory-constrained serving (consumer GPUs, edge deployment, dense multi-tenant); FP8 for throughput-maximizing serving on Hopper or Blackwell. Different bit widths for different constraint regimes; the same calibration set works for both.

Code: a distillation pipeline in TypeScript

The other half of the compression stack is distilling a frontier teacher into a deployable student. The minimum viable pipeline is: query the teacher on a representative slice of your traffic, capture (input, output) pairs as JSONL, fine-tune a smaller student on that JSONL. The Vercel AI SDK is the production-idiomatic TypeScript surface for the teacher-query side, and the resulting JSONL feeds directly into any managed fine-tuning API.

typescript
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
// npm install ai @ai-sdk/anthropic together-ai
import { anthropic } from "@ai-sdk/anthropic";
import { generateText } from "ai";
import Together from "together-ai";
import * as fs from "node:fs";

const together = new Together({ apiKey: process.env.TOGETHER_API_KEY });

// Representative slice of production inputs. In practice these come from
// captured traffic — replay-from-prod, not synthetic data. The student
// will only be as good as the input distribution this corpus represents.
const inputs: { input: string }[] = JSON.parse(
  fs.readFileSync("./production-inputs.json", "utf-8")
);

// --------------------------------------------------------------------------
// Phase 1: teacher generation. Run each input through the frontier teacher,
// capture the response. For reasoning tasks, request the full chain-of-thought
// so the student inherits the reasoning structure, not just the answer.
// --------------------------------------------------------------------------
const TEACHER = anthropic("claude-opus-4-7");
const distillSet: { messages: { role: string; content: string }[] }[] = [];

for (const row of inputs) {
  const { text } = await generateText({
    model: TEACHER,
    messages: [
      { role: "system", content: "Think step by step, then give the answer." },
      { role: "user", content: row.input },
    ],
    temperature: 0.0,  // deterministic teacher outputs make eval reproducible
  });

  distillSet.push({
    messages: [
      { role: "user", content: row.input },
      { role: "assistant", content: text },
    ],
  });
}

// Persist as JSONL — the universal fine-tuning input format.
const jsonl = distillSet.map((row) => JSON.stringify(row)).join("\n");
fs.writeFileSync("./distill.jsonl", jsonl);

// --------------------------------------------------------------------------
// Phase 2: student fine-tune. Submit the (input, teacher-output) JSONL to a
// managed fine-tuning API targeting a smaller open-weight model. The result
// is a deployable adapter that inherits a chunk of the teacher's behavior
// at a fraction of the inference cost.
// --------------------------------------------------------------------------
const file = await together.files.upload(
  fs.createReadStream("./distill.jsonl"),
  { purpose: "fine-tune" }
);

const job = await together.fineTuning.create({
  training_file: file.id,
  model: "meta-llama/Meta-Llama-3.1-8B-Instruct-Reference",
  n_epochs: 3,
  learning_rate: 1e-4,
  lora: true,           // PEFT for cost, per the LoRA article's argument
  lora_r: 32,
  lora_alpha: 64,
  lora_trainable_modules: "all-linear",
});

console.log("distillation fine-tune submitted:", job.id);
// When job.status === "completed", job.output_name is a deployable model ID
// that costs roughly 1/10th the teacher per call.

The shape of this pipeline is the same shape every production team converges on. The variables are which teacher (Opus, GPT-5.5, Gemini Pro), which student (Haiku-class, Llama 8B/70B, Qwen 7B/32B), and which fine-tuning substrate (managed API, in-house TRL+PEFT, vendor-specific). The mechanics — query teacher, capture pairs, fine-tune student — don’t change.

Two things to call out. First, the temperature: 0.0 on the teacher is deliberate: deterministic outputs make the distillation corpus reproducible and the student’s eval results comparable across runs. Non-determinism in the teacher injects noise that the student has to learn to ignore, which costs capacity. Second, the LoRA configuration matches yesterday’s PEFT defaults because the economics that argued for PEFT during behavioral fine-tuning argue equally hard during distillation fine-tuning. The cost case for full-parameter distillation collapses the same way the cost case for full-parameter SFT collapsed; PEFT is the substrate either way.

Trade-offs, failure modes, gotchas

The INT4 quality cliff. INT4 is not free below a certain model size or above a certain task difficulty. The pattern: a 70B model in INT4 loses 0.5–1 percentage point on most benchmarks; a 7B model in INT4 loses 2–4 points on reasoning-heavy tasks. The smaller the model, the less redundancy there is in its weights, and the more each quantization-induced error costs in downstream quality. The diagnostic: run your eval suite against the quantized model on your actual workload before committing. The fix when INT4 misses the cliff: step up to W8A16 (essentially free quality, half the memory savings) or to FP8 if on supporting hardware.

The activation-outlier problem. Above ~7B parameters, naïve activation quantization fails catastrophically because a handful of activation channels carry magnitudes 100×+ larger than the rest. SmoothQuant solves it for INT8; FP8’s wider dynamic range makes it less acute; INT4 activation quantization remains an active research problem and is not yet a production-ready format. The symptom: the model quantizes cleanly, eval scores drop 10+ percentage points, and the cause is unobvious until you look at the activation histograms. The fix: use weight-only quantization (W4A16, W8A16) unless you’ve explicitly applied SmoothQuant or FP8.

Calibration-set distribution mismatch. The most common quantization failure in production. A quantization run calibrated on WikiText or C4 will produce a model that’s accurate on those distributions and degraded on yours. The symptom: eval scores look fine on standard benchmarks but quality complaints come in from real users. The fix: calibrate on a representative slice of production traffic, treat the calibration set as a first-class artifact with its own versioning, and re-calibrate when the workload distribution shifts. The 500-sample calibration set is often the single most underinvested component of a quantization pipeline.

Distillation’s capability ceiling. Distillation transfers what the teacher demonstrates on the calibration distribution. It does not transfer capabilities the calibration corpus doesn’t exercise. A distilled student trained on math problems will not be able to write code; a distilled student trained on English will not handle French. The student’s capability set is the intersection of (its base model’s pre-trained capabilities) and (the slice of the teacher’s behavior the corpus captured). The symptom: the student looks great on the eval set, fails on a related task the eval set didn’t cover. The fix: diversify the distillation corpus toward the full deployment distribution, not just the most common task.

Sequence-level distillation’s calibration drift. Distilling from a closed-API teacher via sequence-level distillation throws away the teacher’s full distribution and keeps only the argmax token. The result: the student’s calibration drifts. The student becomes overconfident on the teacher’s preferred answers and underconfident on the alternatives the teacher’s distribution would have considered. This shows up as worse performance under LLM-as-judge evaluation with pairwise comparisons, and worse downstream behavior when the student is used in agentic loops that rely on uncertainty signals. The mitigation: on-policy distillation (MiniLLM-style) if you have access to teacher logits; sampling-based distillation (capture multiple teacher responses per input) if you don’t.

Quantization-aware training vs post-training quantization. All the techniques above are post-training (PTQ) — quantize after the model is already trained. The alternative is quantization-aware training (QAT), where the model is fine-tuned with simulated quantization in the forward pass so the optimizer adapts the weights to be more quantization-friendly. QAT recovers an additional 0.3–0.7 percentage points of quality at INT4 vs PTQ, at the cost of running a full fine-tuning pass. For most production stacks PTQ is the right starting point; QAT is the escalation when PTQ has hit its quality ceiling and the workload still needs more headroom. PTQ-first is the right default because the cost-to-implement-and-test ratio is overwhelmingly in PTQ’s favor — quantize once with llm-compressor, eval, decide if QAT is justified.

The “everything FP8” trap. FP8 is so good on Hopper and Blackwell that the temptation is to use it everywhere. But FP8 requires hardware support: on pre-Hopper GPUs (A100, V100, consumer cards before the 50-series), FP8 falls back to software emulation that’s slower than FP16. The deployment-vs-format match matters: FP8 if your fleet is H100/H200/B200; AWQ INT4 if your fleet is older or mixed; W8A8 SmoothQuant if your fleet is mixed and you want a portable middle ground.

Distillation as a knowledge-injection vehicle. Distillation is sometimes pitched as a way to inject knowledge into a smaller model. It isn’t — at least not reliably. The student inherits patterns and behaviors from the teacher’s outputs, but factual recall remains tied to the student’s pre-trained base, not the teacher’s. The knowledge-injection wall that constrains SFT also constrains distillation. If the goal is to give a small model access to facts the teacher knows, the right answer is RAG, not distillation. Use distillation to transfer capabilities (reasoning, format, style); use RAG to transfer facts.

Further reading from the field

  • LoRA and Parameter-Efficient Fine-Tuning — the training-side counterpart. QLoRA’s NF4 quantization is the inference-quantization toolkit applied to training memory; the techniques are the same family.
  • Fine-Tuning vs RAG: When to Choose Which — the decision tree that closes with “distill from a frontier teacher” as the production escalation. Distillation is the fourth step in that sequence; this article covers the mechanics.
  • Inference Latency: Prefill, Decode, and Batching — the upstream system the quantized model gets deployed into. Quantization is one of the largest single throughput wins on top of continuous batching; the gains stack.
  • Cost Optimization and Model Routing — the application-side cost lever. Distillation produces the small models that routing tiers depend on; together they’re the dominant cost-reduction techniques in the 2026 production stack.