jatin.blog ~ $
$ cat ai-engineering/lora-peft.md

LoRA and Parameter-Efficient Fine-Tuning

LoRA, QLoRA, DoRA, and the PEFT stack in 2026: the math, the production defaults (rank, alpha, target modules), and the multi-tenant serving pattern.

Jatin Bansal@blog:~/ai-engineering$ open lora-peft

A team running a fine-tuning experiment in 2021 needed eight A100s and a few days to fine-tune a 7B model. The same workload in 2026 runs on a single consumer GPU in an afternoon — and produces an artifact that’s two to four megabytes on disk instead of fourteen gigabytes. The technique that closed that gap is Low-Rank Adaptation, and the reason it matters is not just the unit economics of training; it’s that LoRA adapters are small enough to deploy thousands of fine-tuned models on top of a single base model loaded once in GPU memory. The “fine-tune your own model” pattern that used to require provisioning a dedicated cluster per tenant turned into “ship a 4 MB adapter alongside the base model checkpoint, swap it in milliseconds at request time.” That shift is what made fine-tuning a default tool in the production stack rather than a research luxury.

Opening bridge

Yesterday’s piece on DPO walked through the loss function that has replaced PPO-style RLHF as the production default for alignment, and slipped in references to LoRA and peft_config without unpacking them. Today’s piece zooms in on that scaffolding. Every modern post-training loop — SFT, DPO, ORPO, RLAIF — runs on top of PEFT, because full-parameter fine-tuning of a 70B model is no longer a serious production option for anyone outside the frontier labs. The reason the DPO scaffold from yesterday trained on a single GPU at all is that LoRA adapters cut the trainable parameter count by 100×, the optimizer state by the same factor, and the activation memory enough that the reference model can co-resident with the policy. The economic argument for fine-tuning at all — the one Fine-Tuning vs RAG leans on when the decision tree lands on “fine-tune” — assumes PEFT throughout. Without PEFT, the cost math for fine-tuning collapses and RAG wins by default.

Definition

Parameter-Efficient Fine-Tuning (PEFT) is the family of techniques that adapt a pre-trained model to a downstream task by training a small number of new or modified parameters while leaving the original weights frozen. The canonical instance is LoRA (Low-Rank Adaptation), which approximates the weight update for each adapted layer as the product of two low-rank matrices B @ A, where A and B together have orders of magnitude fewer parameters than the layer’s full weight matrix. For a transformer layer with a weight matrix W ∈ R^(d×k), LoRA proposes W + ΔW = W + BA where B ∈ R^(d×r), A ∈ R^(r×k), and r ≪ min(d, k). The full base model W stays frozen; only A and B are trained. At inference time you can either compute the two products separately and add them (the cost is two extra small matmuls per layer) or merge BA back into W once and serve the resulting model with no additional overhead.

The reason LoRA matters beyond the parameter-count win is the empirical observation from the original LoRA paper (Hu et al., 2021): the weight update ΔW that a full fine-tune installs is itself approximately low-rank. Fine-tuning doesn’t need to rewrite every dimension of the weight matrix; it needs to nudge a handful of directions. So restricting the update to be exactly low-rank from the start doesn’t cost you much, because the unrestricted update was nearly low-rank anyway. This is the entire pitch: the constraint is matched to the structure of the problem.

Intuition

Imagine you’ve been told to write small edits on top of an existing book. You could photocopy every page and mark it up — the equivalent of full fine-tuning, where every parameter is touched. Or you could attach a single sticky note to each chapter that says “in this chapter, do X differently.” LoRA is the sticky-note version. The sticky notes are far smaller than the book, you can swap which set of stickies you’ve attached on demand, and as long as the stickies capture the actual editorial intent, the result reads about the same as the marked-up photocopy.

The deeper way to see it: a transformer layer’s weight matrix is doing a transformation that has way more capacity than any single downstream task needs. A 4096×4096 attention projection has 16M parameters. The behavioral change a fine-tune installs — “answer in JSON,” “speak in our brand voice,” “refuse these specific request categories” — is encoded by a much smaller change in the function the layer computes. Linear-algebra intuition: most weight matrices, after pretraining, are already close to whatever fine-tuned target you’d want; the difference between them lives in a low-dimensional subspace. LoRA gives you a parameterization that lives exactly in that subspace.

The distributed-systems parallel

LoRA is a copy-on-write overlay for model weights. Filesystem snapshots, container image layers, Git object storage, all the way down to copy-on-write semantics in fork() — the recurring trick is don’t modify the base, write the diff somewhere cheap and compose the two on read. Docker image layers are the closest fit: the base image is multi-gigabyte, read-only, shared across hundreds of containers; each container layer is small, mutable, mounted on top, and composed at runtime via overlayfs. The substitution into LoRA: the base model is the multi-gigabyte read-only image, the adapter is the per-tenant overlay, and “running a container” is “serving inference for that tenant.” The economics of Docker — one base image, hundreds of layered containers per host — is the same economics that makes multi-LoRA serving on vLLM viable. You load the 70B base once, hold adapters in cheap memory (CPU RAM, NVMe), page them onto the GPU per request, throw them out, repeat.

Adapter merging is checkpoint vs delta. Database design has a long-running tension between “store the current state” (checkpoint) and “store the sequence of changes” (delta / write-ahead log). The trade-off is the same: deltas are cheap to write and compose, but reading them out requires replaying or summing; checkpoints are expensive to materialize but cheap to read. PEFT’s merge_and_unload() is exactly the WAL-checkpoint operation: the adapter is the delta, the base model is the prior checkpoint, and the merge materializes the new checkpoint by folding the delta in. Once merged, you’ve lost the ability to swap adapters cheaply (the delta is gone), but inference latency drops because there’s no per-layer extra matmul. The trade-off is single-tenant latency vs multi-tenant flexibility, and the right answer depends on which axis you’re optimizing.

Mechanics: the rank-r decomposition

The math of LoRA is one equation, three hyperparameters, and one initialization trick. Worth being precise about each.

The equation. For each adapted layer, the modified forward pass becomes:

text
1
h = Wx + (α/r) · BAx

where W is the frozen pretrained weight, B ∈ R^(d×r) and A ∈ R^(r×k) are the trainable matrices, α is the LoRA alpha scaling factor, and r is the rank. At initialization, A is sampled from a small Gaussian and B is initialized to zero — so the initial BA product is the zero matrix, meaning the LoRA-augmented model is exactly the base model at step 0. This is critical: the optimizer starts from the base model, not from a randomly perturbed version, and any drift is purely the result of training.

The rank r. This is the bottleneck dimension — the number of “directions” in weight space the adapter can move along. The cost is linear in r: doubling the rank doubles the parameter count and roughly doubles the memory and compute cost of the LoRA forward/backward. Higher rank gives the adapter more capacity to fit complex behavior changes; lower rank acts as an implicit regularizer that prevents overfitting on small datasets. The production consensus in 2026: r=16 for stylistic adjustments, r=32 for general SFT, r=64 for complex multi-turn behaviors or coding tasks. Going above 128 is usually a sign you should be running full fine-tuning or switching to a stronger base model.

The alpha α. This is the scaling factor applied to the LoRA delta. The actual update is (α/r) · BA, not just BA — so increasing α at fixed r amplifies the adapter’s contribution to the forward pass. The convention is to set α = r (giving a scaling factor of 1.0), or α = 2r (giving a factor of 2.0) as a stronger default. The Unsloth team’s 2026 documentation recommends starting with α = r = 16 for stability; if the loss curve is sluggish, bump α (not r) to amplify the existing capacity rather than adding new parameters.

The target modules. The single hyperparameter that most affects final quality. The original LoRA paper applied adapters only to q_proj and v_proj (the attention query and value projections), but the QLoRA paper (Dettmers et al., 2023) showed that applying LoRA to all linear layers — q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj — consistently outperforms attention-only LoRA for the same parameter budget. The MLP layers (gate_proj, up_proj, down_proj) are where most factual recall and stylistic behavior is encoded; leaving them frozen means the adapter can’t reach the parts of the network that hold the most fine-tunable behavior. The 2026 default in PEFT 0.13+ is target_modules="all-linear", which picks up every linear layer in the model automatically. For most workloads this is the right starting point.

The math is one paragraph; the rest of LoRA’s complexity lives in the hyperparameter interactions and the variants that fix specific failure modes.

QLoRA: making it run on a consumer GPU

QLoRA (Dettmers et al., 2023) is the engineering achievement that put 65B-parameter fine-tuning onto a single 48GB GPU. The pitch: instead of holding the frozen base model in 16-bit precision, hold it in 4-bit, then attach standard 16-bit LoRA adapters on top. The 4-bit base is read-only — gradients flow through it via dequantization on the fly, but the quantized weights themselves never need to be updated, so the imprecision of 4-bit doesn’t show up where it would hurt. The trainable LoRA adapters stay in 16-bit, where the optimizer needs the dynamic range to make small updates accurately. Three engineering tricks make this work:

NF4 (NormalFloat 4-bit). A custom 4-bit data type designed to be information-theoretically near-optimal for normally distributed weights. Standard 4-bit integer quantization spreads its 16 representable values uniformly across the range, but weight distributions are bell-shaped, so most of the precision is wasted on tails that contain few values. NF4 packs the 16 representable values quantile-spaced over a normal distribution, so each value covers roughly the same number of actual weights. Empirically, NF4 matches BF16 performance on downstream tasks; FP4 (the standard 4-bit float) is about 1% behind.

Double quantization. Standard 4-bit quantization stores the quantization constants (the scale factors that map quantized integers back to floats) in FP32 — and at 64 weights per scale factor, those scale factors add up to about 0.5 bits per weight of overhead. Double quantization quantizes the quantization constants themselves down to 8-bit, saving another 0.4 bits per weight at minimal quality cost. The names are confusing but the trick is real: it’s worth roughly 1 GB of memory for a 33B model.

Paged optimizers. The optimizer state for AdamW is 2× the size of the parameters being optimized — for a 7B-parameter LoRA fine-tune, that’s manageable, but when the optimizer state is unpredictable (e.g. during a memory spike when a long sequence comes through), the GPU will OOM. QLoRA uses NVIDIA’s unified memory to page optimizer state to CPU RAM when GPU memory pressure is high. The performance cost is real (CPU paging is slow) but it’s the difference between a workload running and a workload OOMing, which is the right trade-off in practice.

The end result: a 7B model fine-tunes in about 6 hours on a single A100 40GB at a cost of around $12 for an Llama 3 8B equivalent run on a public cloud GPU in 2026. A 70B model fine-tunes on a single A100 80GB or H100. The cost ratio versus full fine-tuning is roughly 50–100×, with most production benchmarks showing QLoRA within 1–2 percentage points of full fine-tuning on AlpacaEval or domain-specific evals.

The variant zoo

Like DPO, LoRA spawned a family of follow-ups. The four worth knowing in 2026:

DoRA (Weight-Decomposed Low-Rank Adaptation, Liu et al., ICML 2024 Oral). The most empirically robust LoRA successor. DoRA decomposes the pretrained weight matrix into a magnitude vector and a direction matrix, then uses LoRA to update only the direction. Pretrained weights have meaningful magnitudes that LoRA’s symmetric BA update tends to perturb in ways that hurt training stability; DoRA preserves those magnitudes by separating them out. The empirical payoff is consistent: DoRA outperforms LoRA at the same parameter budget, often by 2–5 points on commonsense reasoning and instruction-following benchmarks. The cost is roughly 10–20% slower training due to the extra normalization step. In PEFT 0.10+, enable with use_dora=True in LoraConfig. Use it when you have a LoRA pipeline that works but is leaving quality on the table; the migration is one config flag.

LoRA+ (Hayou et al., ICML 2024). Notes that the standard LoRA setup uses the same learning rate for matrices A and B, but the scaling analysis for wide networks shows this is suboptimal — matrix B should be trained with a higher learning rate than A, typically 8–16×. The fix is one line: pass a learning-rate multiplier for B separately. The empirical gain is 1–2% on benchmarks plus a 2× speedup in convergence. PEFT supports this via the loraplus_lr_ratio parameter. Use it when LoRA training is taking longer than you’d expect; it’s a free improvement on training efficiency.

rsLoRA (Rank-Scaled LoRA, Kalajdzievski 2023). Replaces the α/r scaling factor with α/sqrt(r). The motivation: as you scale r up, the standard α/r scaling shrinks the effective LoRA contribution faster than the parameter count grows, so larger ranks become harder to train effectively. The square-root rescaling fixes this and lets very high ranks (r=128, 256) actually pay off when the task demands them. In PEFT, set use_rslora=True. Use it when you’ve identified that you actually need high rank — coding, complex multi-step reasoning — and standard LoRA isn’t converging well above r=64.

PiSSA (Principal Singular Values and Vectors Adaptation, Meng et al., 2024). Instead of initializing A from a small Gaussian and B from zero, PiSSA initializes both from the principal components of the base weight matrix via SVD. The base weight is then “residualized” — what’s left after subtracting the top-r SVD components — and the LoRA adapter starts already aligned with the directions of largest variance. Empirically this converges 2–3× faster than vanilla LoRA initialization and often reaches a better final loss. The trade-off: PiSSA initialization requires an SVD pass over the full base weights, which is one-time but expensive for a 70B model. Set init_lora_weights="pissa" in PEFT.

A summary table:

MethodInit / MechanismPrimary fix vs LoRAPEFT flag
LoRAA from Gaussian, B from zero— (baseline)(default)
QLoRALoRA on top of 4-bit-quantized baseMemory cost ~10× lowerbnb_4bit_quant_type="nf4"
DoRADecompose into magnitude + directionHigher final qualityuse_dora=True
LoRA+Higher LR on B than on AFaster convergenceloraplus_lr_ratio
rsLoRAα/sqrt(r) scalingHigher rank trains betteruse_rslora=True
PiSSAInit from base SVD top componentsFaster convergenceinit_lora_weights="pissa"

The honest 2026 pattern: start with QLoRA + standard LoRA at r=32, α=64, target_modules="all-linear", then enable use_dora=True as the cheap upgrade if quality is the bottleneck, or use_rslora=True if you need higher rank. The other variants are situational.

Code: a full QLoRA fine-tune in TRL + PEFT + bitsandbytes

The pipeline below trains a 7B model on a single 24GB GPU. Setup: pip install "trl>=0.22" "transformers>=4.46" "datasets>=3" "peft>=0.13" "bitsandbytes>=0.43" accelerate.

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
import torch
from datasets import load_dataset
from peft import LoraConfig, get_peft_model
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
)
from trl import SFTConfig, SFTTrainer

BASE_MODEL = "Qwen/Qwen2.5-7B-Instruct"
OUT_DIR = "./qlora-qwen-7b"

# 4-bit NF4 quantization: the frozen base sits in 4-bit; LoRA adapters stay in bf16.
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",          # information-theoretically near-optimal
    bnb_4bit_use_double_quant=True,     # ~0.4 bits/weight saved
    bnb_4bit_compute_dtype=torch.bfloat16,
)

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    quantization_config=bnb_config,
    device_map="auto",
)

# LoRA config: the hyperparameter triangle.
peft_config = LoraConfig(
    r=32,                               # rank: 32 is the general-SFT default
    lora_alpha=64,                      # alpha = 2r is a common stronger default
    target_modules="all-linear",        # 2026 default; picks every linear layer
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    use_dora=True,                      # DoRA: cheap quality upgrade
    # init_lora_weights="pissa",        # PiSSA: faster convergence; expensive init
    # use_rslora=True,                  # rsLoRA: enable if going to r>64
)

# Wrap the base model with the LoRA adapter.
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
# >>> trainable params: 21,143,552 || all params: 7,636,361,856 || trainable%: 0.27

# Public instruction-tuning dataset; small subset for demonstration.
ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft[:5000]")

config = SFTConfig(
    output_dir=OUT_DIR,
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,      # effective batch size 16
    learning_rate=2e-4,                 # LoRA LR is ~10x higher than full FT
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    max_seq_length=2048,
    bf16=True,
    logging_steps=20,
    save_strategy="epoch",
    optim="paged_adamw_8bit",           # QLoRA's paged optimizer
)

trainer = SFTTrainer(
    model=model,
    args=config,
    train_dataset=ds,
    processing_class=tokenizer,
)
trainer.train()
trainer.save_model(OUT_DIR)             # saves only the ~84 MB adapter, not 14 GB base

Three things worth pulling out of this scaffold.

First, the optim="paged_adamw_8bit" is the paged optimizer from QLoRA. It pages optimizer state to CPU RAM under memory pressure; without it, long sequences in your batch can OOM the GPU during the AdamW second-moment update. Use it whenever you’re at the memory ceiling on a single GPU.

Second, the saved checkpoint at OUT_DIR is the adapter only — typically 50–200 MB depending on rank and model size. The base model is not duplicated; you reload it at inference time from the original checkpoint and attach the adapter. This is what makes multi-tenant serving viable: 100 adapters at 100 MB each is 10 GB of disk, versus 100× the base model size if you’d done full fine-tuning.

Third, the learning rate (2e-4) is roughly 10× higher than full fine-tuning’s 2e-5. The LoRA adapter starts from zero contribution, so the optimizer needs aggressive updates to install the behavior; a too-low LR will just stall.

Code: a managed LoRA fine-tune in TypeScript

Most production teams don’t want to operate GPUs for fine-tuning. The managed APIs from Together AI, OpenAI, and Fireworks all accept the same (prompt, response) JSONL shape and run LoRA fine-tuning on their infrastructure, returning a deployable model ID.

typescript
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
// npm install together-ai
import Together from "together-ai";
import * as fs from "node:fs";

const together = new Together({ apiKey: process.env.TOGETHER_API_KEY });

// 1. Upload a JSONL file of {messages: [{role, content}, ...]} rows.
const file = await together.files.upload(
  fs.createReadStream("./sft.jsonl"),
  { purpose: "fine-tune" }
);

// 2. Submit a LoRA fine-tune. The cost ratio vs full fine-tuning is
//    typically 5-10x lower, and the resulting adapter can be deployed
//    on the shared Together base model rather than a dedicated endpoint.
const job = await together.fineTuning.create({
  training_file: file.id,
  model: "meta-llama/Meta-Llama-3.1-8B-Instruct-Reference",
  n_epochs: 3,
  learning_rate: 2e-4,
  lora: true,                    // LoRA fine-tune, not full
  lora_r: 32,                    // rank
  lora_alpha: 64,                // typically 2r
  lora_dropout: 0.05,
  lora_trainable_modules: "all-linear",
  // For 70B+ models, request QLoRA mode automatically:
  // quantization: "nf4",
});

console.log("submitted LoRA job", job.id);
// When job.status === "completed", job.output_name is a deployable model ID
// that can be called via the standard chat completions API.

The trade-off between the two paths follows the same shape as the DPO trade-off: the managed API is the right starting point when the defaults work; you graduate to TRL+PEFT when you need control over the loss, the data preprocessing, the adapter merging strategy, or non-standard variants like DoRA/PiSSA that the managed APIs may not expose.

Production deployment: merge vs serve unmerged

After training, the adapter and base can either be merged into a single checkpoint or served as two artifacts composed at runtime. The decision is a single-axis trade-off and the right answer depends entirely on whether you’re serving one fine-tuned model or many.

Single-tenant: merge. If exactly one adapter will ever be served against this base model, merge_and_unload() folds the adapter into the base weights and returns a standard transformers checkpoint. Inference latency drops by roughly 5–15% (no per-layer LoRA matmul) and deployment simplifies (no PEFT dependency at inference time). The cost is permanent: once merged, you can’t swap adapters or unmerge cheaply.

python
1
2
3
4
5
6
7
8
9
from peft import PeftModel
from transformers import AutoModelForCausalLM

base = AutoModelForCausalLM.from_pretrained(BASE_MODEL, torch_dtype="bfloat16")
peft_model = PeftModel.from_pretrained(base, "./qlora-qwen-7b")

# Merge LoRA into the base; returns a standard transformers model.
merged = peft_model.merge_and_unload()
merged.save_pretrained("./qwen-7b-merged")

Multi-tenant: serve unmerged with hot-swap. If you’re serving N adapters against the same base model — one per customer, one per use case, one per A/B test variant — keep them unmerged. vLLM 0.7+ supports native dynamic LoRA loading: the base model lives permanently in GPU memory, adapters sit in CPU RAM (or NVMe), and the runtime pages an adapter onto the GPU per request — a transfer that takes single-digit milliseconds because the payload is tiny. The --enable-lora and --max-loras flags configure how many adapters can be resident on the GPU at once; PCIe transfer overhead is amortized across the request’s prefill and decode passes.

bash
1
2
3
4
5
6
7
8
9
# Launch vLLM with multi-LoRA serving
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --enable-lora \
  --max-loras 8 \
  --max-lora-rank 64 \
  --lora-modules \
    tenant-a=./adapters/tenant-a \
    tenant-b=./adapters/tenant-b \
    tenant-c=./adapters/tenant-c

At inference time, the client requests the adapter by name in the model field of the API call, and vLLM routes the request through the correct LoRA. The cost economics flips here: instead of provisioning one GPU per fine-tuned model, you provision one GPU and host dozens to thousands of tenants on it. The break-even point is around 3–4 adapters; below that, single-adapter merged serving is faster, but above it, multi-LoRA dominates by an order of magnitude on dollar cost per request.

The S-LoRA paper (Sheng et al., 2023) is the foundational work on this serving pattern — they showed that with batched LoRA computation and adapter caching policies, throughput stays close to base-model throughput even with thousands of adapters resident on a single GPU. vLLM’s implementation is the production-grade descendant of that work.

Trade-offs, failure modes, gotchas

Capacity ceiling. LoRA’s compression is not free — at extreme parameter cuts (very low rank, attention-only target modules) the adapter literally cannot represent some target functions. The symptom: training loss drops, eval loss plateaus, and the gap to full fine-tuning never closes regardless of how long you train. The fix: increase rank, add MLP target modules, or accept that this is a workload where full fine-tuning is the right tool. The LoRA capacity question is one of the most-studied LoRA failure modes; the rule of thumb is that LoRA captures 95–99% of full fine-tuning quality on stylistic and behavioral tasks but loses ground on tasks that require substantial new knowledge or significant restructuring of the base model’s reasoning. The fine-tuning vs RAG decision is partly a question of whether LoRA’s capacity ceiling is hit.

Target-module misses. Forgetting to include MLP modules in target_modules is the most common LoRA mistake. The original LoRA paper used ["q_proj", "v_proj"] as the example, and a lot of tutorial code copied that without updating. Attention-only LoRA leaves a lot of behavioral capacity on the table; for any task harder than style transfer, set target_modules="all-linear" and move on.

The α scaling trap. A common confusion: people increase α expecting “more capacity,” but α at fixed r is just a scalar multiplier on the existing capacity. The capacity is set by r. If your model is undertrained, raising α will speed convergence; if your model is undertrained because the adapter is too small, raising α won’t help — only raising r will. The diagnostic: if training loss is dropping but eval loss has plateaued early, you’re capacity-bound on r; if both losses are still dropping but slowly, you might gain from a higher α (or a higher learning rate, which is mathematically equivalent at low LR).

Merge-time numerical drift. Merging the adapter into the base introduces a small numerical change because the merge happens in lower precision than the original fine-tuning. For BF16 base models with BF16 LoRA, the drift is negligible; for QLoRA’s NF4 base, merging is more delicate — the standard merge_and_unload() will dequantize the base to FP16 first, then merge, then quantize back (or save in FP16). The merged checkpoint will be slightly behind the unmerged QLoRA model on eval, typically by 0.1–0.5 points. The community workaround is to save the QLoRA adapter unmerged and either serve it that way (multi-LoRA) or merge it onto a non-quantized base for single-tenant deployment.

Adapter sprawl. Once you can fine-tune cheaply, you fine-tune a lot. Production teams that don’t impose discipline on adapter creation end up with hundreds of adapters whose provenance, training data, and intended use are lost. The same observability discipline that applies to LLM systems generally applies harder to adapter sprawl — every deployed adapter needs a registry entry with its training data hash, hyperparameters, eval results, and owner. Otherwise the multi-LoRA serving pattern that saves you money in compute will cost you double in confusion.

Multi-LoRA throughput vs adapter density. Hosting K adapters on a single GPU sounds free, but each adapter request adds a small computational tax (the per-layer LoRA matmul). At low adapter density (K=2-4) the throughput hit vs a single merged model is single-digit percentage points; at K=100 it can be 30–50% depending on adapter rank. The implication: very-high-density multi-LoRA serving is the right pattern for long-tail tenants (each one calls the API rarely), not for high-traffic tenants (where the per-request tax compounds). Production teams typically run a mix: a few merged-base hot tenants on dedicated GPUs, and a multi-LoRA pool for the long tail.

Distribution drift between training and serving precisions. A QLoRA model trained with NF4 quantization will behave slightly differently when served as FP16 (post-merge) versus 4-bit (unmerged). The base model is the same checkpoint, but its computational behavior under different precisions is not identical. Production teams typically catch this by running their eval suite against the deployed configuration (not just the training-time configuration); a 0.5-point drop on the eval suite between training and deployed serving is a known pattern, not a bug. The broader set of techniques — INT4/INT8/FP8 weight quantization, AWQ vs GPTQ, the SmoothQuant activation-outlier fix — sits in the quantization and distillation article; this section is the QLoRA-specific slice of that surface.

Further reading from the field