$ cat ai-engineering/quantization-distillation.md

Quantization and Distillation: Compression for Inference

Quantization and distillation methods for reducing inference memory, latency, and serving cost.

Jatin Bansal@blog:~/ai-engineering$ open quantization-distillation

A 70B-parameter model uses about 140 GB for 16-bit weights before activation memory. INT4 weights use about 35 GB. Quantization reduces numeric precision, while distillation trains a smaller student to reproduce a larger teacher’s behavior. Both trade some quality for lower serving cost, so deployment requires workload-specific evaluation.

Quantization and distillation

Quantization is the process of representing a tensor with fewer bits per element. A 16-bit weight occupies 2 bytes; an INT8 weight occupies 1 byte; an INT4 weight occupies half a byte. The mapping from the original floating-point value to the quantized integer is a function q = round((x - z) / s), where s is the scale (the floating-point span each integer step covers) and z is the zero point (the integer that maps back to zero in the original space). Dequantization at inference time recovers an approximation: x' = q * s + z. The two free parameters; s and z; are chosen per tensor (or per channel, or per group of channels) to minimize the round-trip error, and the choice of granularity is the first major axis of the design space.

Distillation is the process of training a smaller student model to imitate the output of a larger teacher model. The 2015 paper that named the technique, Hinton, Vinyals, and Dean’s “Distilling the Knowledge in a Neural Network”, made the core observation that became the entire field: the teacher’s full probability distribution over the output classes carries far more information than the hard label alone, because the relative ranking of the wrong answers encodes structure about the input that the student can learn from. A teacher that says “this is 80% cat, 19% dog, 1% airplane” tells the student something a teacher that just says “cat” doesn’t; namely that the input is more dog-like than airplane-like. Training the student to match the soft distribution rather than the hard label is the entire pitch, and it works.

Both techniques compress; the difference is what they compress. Quantization keeps the same number of weights but stores each one cheaper. Distillation throws away weights entirely and recovers behavior through a different network shape. Most production stacks use both.

Quantization mechanics

The mechanical primitives are worth being precise about. The vocabulary is dense but each piece pulls weight in production.

Bit width. The headline number. The 2026 production lineup is INT8, INT4, FP8, and increasingly FP4. INT8 retains essentially full quality on most workloads at 2× memory savings vs FP16; INT4 cuts memory 4× but starts to show degradation on reasoning-heavy tasks; FP8 sits between the two on memory but pulls native hardware support on Hopper and Blackwell GPUs; FP4 is the Blackwell-era frontier and the place where current research is concentrated.

Quantization scheme. Symmetric vs asymmetric: symmetric maps [-α, α] to [-2^(b-1), 2^(b-1) - 1] with z=0; asymmetric maps [min, max] to [0, 2^b - 1] with z non-zero. Weights are usually symmetric (their distributions are roughly centered); activations are usually asymmetric (they’re often non-negative after ReLU/SiLU). Uniform vs non-uniform: uniform spaces the representable values evenly across the range; non-uniform (e.g. NF4, the NormalFloat 4-bit format from QLoRA) spaces them according to a target distribution; for normally-distributed weights, NF4 packs values quantile-spaced over the normal so each representable value covers about the same density of actual weights. NF4 is the reason QLoRA works at 4 bits where naive INT4 doesn’t.

Granularity. Where the (s, z) parameters are computed. Per-tensor: one scale for the whole matrix. Fast, brutal, loses quality on anything but small models. Per-channel: one scale per output channel of a linear layer. The current standard for weight quantization. Per-group: one scale per group of N consecutive weights inside a channel (typically N=64 or N=128). The best quality-per-bit at INT4; the cost is slightly more metadata. Per-token: one scale per input token for activations. Necessary for activation quantization because activation statistics change per token.

Weight-only vs weight-and-activation. Weight-only (W4A16, W8A16): quantize only the weights, keep activations in FP16 or BF16. The cheap, safe, default. Used for AWQ and GPTQ. Weight-and-activation (W8A8, W4A8): quantize both. Higher throughput on supporting hardware, but requires solving the activation-outlier problem. SmoothQuant, ZeroQuant, and FP8 fall here.

Static vs dynamic. Static (calibration-based): run a small calibration dataset through the model once to determine (s, z) per channel, freeze them. Dynamic: compute (s, z) per inference at runtime from the actual activation values. Dynamic is more accurate for activations but adds runtime overhead; static is the production default once a good calibration set exists.

Quantization algorithms

The interesting work isn’t in the bit width; it’s in the algorithm that picks the per-channel scales and decides how to compensate for the quantization error.

GPTQ (Frantar et al., 2022) quantizes one transformer layer and one column at a time. After choosing the integer value for a column, it updates the remaining weights to compensate for the quantization error using second-order information from the layer’s input Hessian. This follows the Optimal Brain Quantization approach: each greedy decision changes the remaining choices. GPTQ quantized OPT-175B and BLOOM-176B to 4 bits in about four GPU hours with little perplexity increase. It is slower than naive quantization, but retains more quality at 4 bits.

AWQ (Lin et al., 2023). The activation-aware lightweight. AWQ’s central observation: not all weights are equally important to quantize well. About 1% of weights; the ones connected to the highest-magnitude activations; carry most of the information; the other 99% can be quantized aggressively. AWQ finds the salient weights by looking at the activation magnitudes (not the weight magnitudes; that’s the “activation-aware” name) and applies a per-channel scaling that amplifies the salient weights before quantization, then divides the activations by the same factor at runtime to cancel out the amplification. The math is exact; it’s a similarity transform; but the salient weights now occupy a wider slice of the quantization range, which preserves their precision. AWQ skips GPTQ’s slow Hessian computation and runs much faster, and the quality at INT4 is consistently equal to or better than GPTQ. AWQ is the production INT4 default in 2026; most pre-quantized 4-bit checkpoints on Hugging Face are AWQ.

SmoothQuant (Xiao et al., 2022). The W8A8 unlocker. The activation-outlier problem hits hard above 6.7B parameters: a small fraction of activation channels (often <1%) carry magnitudes 100× larger than the rest, and naive activation quantization either truncates those channels (catastrophic quality loss) or sets the scale to cover them (catastrophic precision loss on the other 99%). SmoothQuant’s fix: migrate the variance from activations to weights by an offline per-channel rescaling. Activations get smoother (easier to quantize), weights get spikier (still quantizable because they have headroom), and a similarity transform absorbs the rescaling so the model’s outputs are unchanged. The result is W8A8 quantization with quality at parity to W8A16, and on hardware with INT8 tensor cores the throughput doubles.

FP8. The hardware-native 8-bit float. Hopper and Blackwell GPUs include native FP8 tensor cores that compute matrix multiplications at 2× the rate of FP16. FP8 comes in two flavors: E4M3 (4 exponent bits, 3 mantissa bits; more precision, less range) and E5M2 (5 exponent bits, 2 mantissa bits; more range, less precision). Production use typically pairs E4M3 for weights and E5M2 for activations. The pitch: native hardware support means no special kernels, near-FP16 quality out of the box, and dramatic memory and throughput gains. TensorRT-LLM doubles throughput on H100 with FP8 at minimal accuracy cost, and on Blackwell the FP8 advantage compounds with the new FP4 path.

A summary table of the 2026 lineup:

Format	Bits	Weight/Act	Calibration	Quality at parity	Hardware
FP16/BF16	16	W16A16	;	baseline	universal
FP8 (E4M3/E5M2)	8	W8A8	per-tensor or per-channel	within ~0.1 pp	Hopper, Blackwell
INT8 (SmoothQuant)	8	W8A8	per-channel + smoothing	within ~0.2 pp	broad
INT8 (weight-only)	8	W8A16	per-channel	within ~0.1 pp	universal
INT4 (AWQ)	4	W4A16	activation-aware per-group	within ~0.5–1 pp	broad
INT4 (GPTQ)	4	W4A16	Hessian-based per-group	within ~0.5–1 pp	broad
NF4 (QLoRA-style)	4	W4A16	none (datatype is the trick)	within ~0.5–1 pp	broad
FP4 (Blackwell)	4	W4A4	per-tensor or per-channel	within ~1 pp on early benchmarks	Blackwell only

FP8 suits Hopper or Blackwell deployments that prioritize throughput and quality. AWQ INT4 suits memory-constrained or smaller-hardware deployments. SmoothQuant W8A8 targets INT8 throughput on broader hardware.

Distillation mechanics

The soft-target loss. The original distillation loss from Hinton et al. is a weighted combination of two terms: the standard hard-label cross-entropy and the soft-target KL divergence between the student and teacher distributions. The trick that makes the soft distribution useful is temperature scaling: divide the logits by T > 1 before the softmax. Higher T flattens the distribution, exposing the smaller probabilities the teacher assigns to the wrong answers. At T=1 the teacher’s confident answer drowns out everything else; at T=4 or T=10, the relative ranking of the wrong answers becomes visible and trainable. The loss is α · CE(student_T1, hard_label) + (1-α) · T² · KL(student_T || teacher_T), with T typically in [2, 10] and α typically around 0.5. The T² factor exists because the gradients of the softened KL are smaller by 1/T², and rescaling restores them to the same magnitude as the hard-label gradient.

Distillation regimes. Logit-level distillation (the Hinton original): the student learns to match the teacher’s full output distribution. Needs access to teacher logits; easy with open-weight teachers, impossible with closed-API teachers that only return text. Sequence-level distillation (the workhorse): generate text from the teacher, fine-tune the student on the (input, generated-output) pairs as standard SFT. Doesn’t need logits; works with any teacher you can query; but throws away most of the per-token distribution information. On-policy distillation (MiniLLM, Gu et al., 2023): student generates a response, teacher scores it, student updates via policy gradient with reverse-KL objective. Better calibration on long generations than forward-KL distillation, but needs teacher logits and is more expensive than offline SFT. The 2026 production default is sequence-level; closed-API teachers are too useful to give up; with on-policy used for capability transfer when access to teacher logits is available.

Dataset construction. The single most important variable. The distillation corpus needs to cover the student’s deployment distribution; if the corpus is off-distribution, the student will be off-distribution. The standard recipe: capture a representative sample of real user traffic, run it through the teacher, use the (input, teacher-output) pairs as training data. For reasoning tasks, capture the full chain-of-thought from the teacher; that’s what DeepSeek-R1 did with its 800k reasoning trajectories, and the result was small students with non-trivial chunks of the teacher’s reasoning quality. The corpus size matters less than the corpus distribution: 50k well-curated examples typically beat 500k randomly sampled ones.

Production examples. DistilBERT (Sanh et al., 2019) is the classic; 40% smaller than BERT, 60% faster, 95% of BERT’s quality on GLUE. The recipe (smaller architecture, distillation from the full BERT teacher) became the template the field has been iterating on since. The 2024-2026 evolution: distill smaller students from much larger teachers, on much larger corpora, with much more careful corpus construction. The DeepSeek-R1-Distill family; 1.5B through 70B students distilled from the R1 teacher; is the current canonical demonstration that distillation can transfer reasoning capabilities, not just pattern-matching, when the corpus is rich enough. OpenAI’s GPT-4o mini is widely understood to be a distillation of GPT-4o, though the recipe is undisclosed.

Post-training quantization in Python

The 2026 idiomatic way to quantize a Hugging Face model for vLLM serving is llm-compressor, the official toolkit maintained by the vLLM team. It supports INT4 (GPTQ-style), INT8 (weight-only and W8A8), and FP8 with a consistent API. Install: pip install llmcompressor>=0.4 transformers>=4.46 datasets>=3.

python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor.transformers import oneshot
from llmcompressor.modifiers.quantization import GPTQModifier, QuantizationModifier

MODEL = "meta-llama/Meta-Llama-3.1-8B-Instruct"
OUT_INT4 = "./Llama-3.1-8B-INT4"
OUT_FP8 = "./Llama-3.1-8B-FP8"

# Load the base model and tokenizer in fp16 (compression converts in place).
model = AutoModelForCausalLM.from_pretrained(MODEL, torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(MODEL)

# Calibration set: a small slice of representative inputs. 512 samples is the
# usual sweet spot — enough to determine per-channel statistics, small enough
# to run on a single GPU in minutes.
calib = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft[:512]")
calib = calib.map(lambda row: {
    "text": tokenizer.apply_chat_template(row["messages"], tokenize=False)
})
calib_texts = calib["text"]

# --------------------------------------------------------------------------
# INT4 weight-only with GPTQ-style Hessian-based quantization.
# This is the workhorse for memory-constrained serving — Llama-3.1-8B in INT4
# fits comfortably on a 16GB GPU vs 18GB+ for fp16.
# --------------------------------------------------------------------------
recipe_int4 = GPTQModifier(
    targets="Linear",
    scheme="W4A16",            # 4-bit weights, 16-bit activations
    ignore=["lm_head"],         # skip the output projection
    dampening_frac=0.1,         # Hessian regularization
)
oneshot(
    model=model,
    dataset=calib_texts,
    recipe=recipe_int4,
    max_seq_length=2048,
    num_calibration_samples=512,
    output_dir=OUT_INT4,
)

# --------------------------------------------------------------------------
# FP8 weight-and-activation quantization for Hopper/Blackwell deployment.
# Native hardware support — the FP8 tensor cores on H100/B200 double matmul
# throughput vs FP16 with essentially no quality loss.
# --------------------------------------------------------------------------
# Reload fresh model since the previous call mutated weights in-place.
model = AutoModelForCausalLM.from_pretrained(MODEL, torch_dtype="auto", device_map="auto")

recipe_fp8 = QuantizationModifier(
    targets="Linear",
    scheme="FP8",               # E4M3 weights, E5M2 activations
    ignore=["lm_head"],
)
oneshot(
    model=model,
    dataset=calib_texts,
    recipe=recipe_fp8,
    max_seq_length=2048,
    num_calibration_samples=512,
    output_dir=OUT_FP8,
)

# Resulting checkpoints can be served directly by vLLM:
#   vllm serve ./Llama-3.1-8B-INT4 --quantization compressed-tensors
#   vllm serve ./Llama-3.1-8B-FP8  --quantization compressed-tensors

Calibration data should match production traffic; code workloads need code samples, while chat workloads need conversational samples. The output projection often remains unquantized because errors there affect the token distribution across the vocabulary. INT4 and FP8 artifacts can share calibration data while serving different hardware and memory constraints.