$ cat ai-engineering/lora-peft.md

LoRA and Parameter-Efficient Fine-Tuning

How LoRA and QLoRA reduce trainable parameters, GPU memory, and adapter deployment cost.

Jatin Bansal@blog:~/ai-engineering$ open lora-peft

Low-Rank Adaptation trains a small adapter while keeping the base model frozen. The resulting checkpoint can be a few megabytes rather than a copy of the full model, which lowers training memory and allows several adapters to share one loaded base model.

Low-rank weight updates

Parameter-Efficient Fine-Tuning adapts a model by training a small set of parameters while the original weights remain frozen. LoRA represents each adapted layer’s update as BA, with B ∈ R^(d×r), A ∈ R^(r×k), and r ≪ min(d, k). At inference time the adapter can remain separate or be merged into W.

The reason LoRA matters beyond the parameter-count win is the empirical observation from the original LoRA paper (Hu et al., 2021): the weight update ΔW that a full fine-tune installs is itself approximately low-rank. Fine-tuning doesn’t need to rewrite every dimension of the weight matrix; it needs to nudge a handful of directions. So restricting the update to be exactly low-rank from the start doesn’t cost you much, because the unrestricted update was nearly low-rank anyway. This is the entire pitch: the constraint is matched to the structure of the problem.

Tune rank, scale, and target modules

The math of LoRA is one equation, three hyperparameters, and one initialization trick. Worth being precise about each.

The equation. For each adapted layer, the modified forward pass becomes:

text

1
h = Wx + (α/r) · BAx

where W is the frozen pretrained weight, B ∈ R^(d×r) and A ∈ R^(r×k) are the trainable matrices, α is the LoRA alpha scaling factor, and r is the rank. At initialization, A is sampled from a small Gaussian and B is initialized to zero; so the initial BA product is the zero matrix, meaning the LoRA-augmented model is exactly the base model at step 0. This is critical: the optimizer starts from the base model, not from a randomly perturbed version, and any drift is purely the result of training.

The rank r. This is the bottleneck dimension; the number of “directions” in weight space the adapter can move along. The cost is linear in r: doubling the rank doubles the parameter count and roughly doubles the memory and compute cost of the LoRA forward/backward. Higher rank gives the adapter more capacity to fit complex behavior changes; lower rank acts as an implicit regularizer that prevents overfitting on small datasets. The production consensus in 2026: r=16 for stylistic adjustments, r=32 for general SFT, r=64 for complex multi-turn behaviors or coding tasks. Going above 128 is usually a sign you should be running full fine-tuning or switching to a stronger base model.

The alpha α. This is the scaling factor applied to the LoRA delta. The actual update is (α/r) · BA, not just BA; so increasing α at fixed r amplifies the adapter’s contribution to the forward pass. The convention is to set α = r (giving a scaling factor of 1.0), or α = 2r (giving a factor of 2.0) as a stronger default. The Unsloth team’s 2026 documentation recommends starting with α = r = 16 for stability; if the loss curve is sluggish, bump α (not r) to amplify the existing capacity rather than adding new parameters.

The target modules. The single hyperparameter that most affects final quality. The original LoRA paper applied adapters only to q_proj and v_proj (the attention query and value projections), but the QLoRA paper (Dettmers et al., 2023) showed that applying LoRA to all linear layers; q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj; consistently outperforms attention-only LoRA for the same parameter budget. The MLP layers (gate_proj, up_proj, down_proj) are where most factual recall and stylistic behavior is encoded; leaving them frozen means the adapter can’t reach the parts of the network that hold the most fine-tunable behavior. The 2026 default in PEFT 0.13+ is target_modules="all-linear", which picks up every linear layer in the model automatically. For most workloads this is the right starting point.

The math is one paragraph; the rest of LoRA’s complexity lives in the hyperparameter interactions and the variants that fix specific failure modes.

QLoRA reduces base-model memory

QLoRA (Dettmers et al., 2023) is the engineering achievement that put 65B-parameter fine-tuning onto a single 48GB GPU. The pitch: instead of holding the frozen base model in 16-bit precision, hold it in 4-bit, then attach standard 16-bit LoRA adapters on top. The 4-bit base is read-only; gradients flow through it via dequantization on the fly, but the quantized weights themselves never need to be updated, so the imprecision of 4-bit doesn’t show up where it would hurt. The trainable LoRA adapters stay in 16-bit, where the optimizer needs the dynamic range to make small updates accurately. Three engineering tricks make this work:

NF4 (NormalFloat 4-bit). A custom 4-bit data type designed to be information-theoretically near-optimal for normally distributed weights. Standard 4-bit integer quantization spreads its 16 representable values uniformly across the range, but weight distributions are bell-shaped, so most of the precision is wasted on tails that contain few values. NF4 packs the 16 representable values quantile-spaced over a normal distribution, so each value covers roughly the same number of actual weights. Empirically, NF4 matches BF16 performance on downstream tasks; FP4 (the standard 4-bit float) is about 1% behind.

Double quantization. Standard 4-bit quantization stores the quantization constants (the scale factors that map quantized integers back to floats) in FP32; and at 64 weights per scale factor, those scale factors add up to about 0.5 bits per weight of overhead. Double quantization quantizes the quantization constants themselves down to 8-bit, saving another 0.4 bits per weight at minimal quality cost. The names are confusing but the trick is real: it’s worth roughly 1 GB of memory for a 33B model.

Paged optimizers. The optimizer state for AdamW is 2× the size of the parameters being optimized; for a 7B-parameter LoRA fine-tune, that’s manageable, but when the optimizer state is unpredictable (e.g. during a memory spike when a long sequence comes through), the GPU will OOM. QLoRA uses NVIDIA’s unified memory to page optimizer state to CPU RAM when GPU memory pressure is high. The performance cost is real (CPU paging is slow) but it’s the difference between a workload running and a workload OOMing, which is the right trade-off in practice.

The end result: a 7B model fine-tunes in about 6 hours on a single A100 40GB at a cost of around $12 for an Llama 3 8B equivalent run on a public cloud GPU in 2026. A 70B model fine-tunes on a single A100 80GB or H100. The cost ratio versus full fine-tuning is roughly 50–100×, with most production benchmarks showing QLoRA within 1–2 percentage points of full fine-tuning on AlpacaEval or domain-specific evals.

Code: a full QLoRA fine-tune in TRL + PEFT + bitsandbytes

The pipeline below trains a 7B model on a single 24GB GPU. Setup: pip install "trl>=0.22" "transformers>=4.46" "datasets>=3" "peft>=0.13" "bitsandbytes>=0.43" accelerate.

python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
import torch
from datasets import load_dataset
from peft import LoraConfig, get_peft_model
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
)
from trl import SFTConfig, SFTTrainer

BASE_MODEL = "Qwen/Qwen2.5-7B-Instruct"
OUT_DIR = "./qlora-qwen-7b"

# 4-bit NF4 quantization: the frozen base sits in 4-bit; LoRA adapters stay in bf16.
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",          # information-theoretically near-optimal
    bnb_4bit_use_double_quant=True,     # ~0.4 bits/weight saved
    bnb_4bit_compute_dtype=torch.bfloat16,
)

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    quantization_config=bnb_config,
    device_map="auto",
)

# LoRA config: the hyperparameter triangle.
peft_config = LoraConfig(
    r=32,                               # rank: 32 is the general-SFT default
    lora_alpha=64,                      # alpha = 2r is a common stronger default
    target_modules="all-linear",        # 2026 default; picks every linear layer
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    use_dora=True,                      # DoRA: cheap quality upgrade
    # init_lora_weights="pissa",        # PiSSA: faster convergence; expensive init
    # use_rslora=True,                  # rsLoRA: enable if going to r>64
)

# Wrap the base model with the LoRA adapter.
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
# >>> trainable params: 21,143,552 || all params: 7,636,361,856 || trainable%: 0.27

# Public instruction-tuning dataset; small subset for demonstration.
ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft[:5000]")

config = SFTConfig(
    output_dir=OUT_DIR,
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,      # effective batch size 16
    learning_rate=2e-4,                 # LoRA LR is ~10x higher than full FT
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    max_seq_length=2048,
    bf16=True,
    logging_steps=20,
    save_strategy="epoch",
    optim="paged_adamw_8bit",           # QLoRA's paged optimizer
)

trainer = SFTTrainer(
    model=model,
    args=config,
    train_dataset=ds,
    processing_class=tokenizer,
)
trainer.train()
trainer.save_model(OUT_DIR)             # saves only the ~84 MB adapter, not 14 GB base

The scaffold makes several training choices explicit.

optim="paged_adamw_8bit" is the paged optimizer from QLoRA. It pages optimizer state to CPU RAM under memory pressure; without it, long sequences in your batch can OOM the GPU during the AdamW second-moment update. Use it whenever you’re at the memory ceiling on a single GPU.

Second, the saved checkpoint at OUT_DIR is the adapter only; typically 50–200 MB depending on rank and model size. The base model is not duplicated; you reload it at inference time from the original checkpoint and attach the adapter. This is what makes multi-tenant serving viable: 100 adapters at 100 MB each is 10 GB of disk, versus 100× the base model size if you’d done full fine-tuning.

Third, the learning rate (2e-4) is roughly 10× higher than full fine-tuning’s 2e-5. The LoRA adapter starts from zero contribution, so the optimizer needs aggressive updates to install the behavior; a too-low LR will just stall.

Merge or hot-swap adapters

After training, the adapter and base can either be merged into a single checkpoint or served as two artifacts composed at runtime. The decision is a single-axis trade-off and the right answer depends entirely on whether you’re serving one fine-tuned model or many.

Single-tenant: merge. If exactly one adapter will ever be served against this base model, merge_and_unload() folds the adapter into the base weights and returns a standard transformers checkpoint. Inference latency drops by roughly 5–15% (no per-layer LoRA matmul) and deployment simplifies (no PEFT dependency at inference time). The cost is permanent: once merged, you can’t swap adapters or unmerge cheaply.

python

1
2
3
4
5
6
7
8
9
from peft import PeftModel
from transformers import AutoModelForCausalLM

base = AutoModelForCausalLM.from_pretrained(BASE_MODEL, torch_dtype="bfloat16")
peft_model = PeftModel.from_pretrained(base, "./qlora-qwen-7b")

# Merge LoRA into the base; returns a standard transformers model.
merged = peft_model.merge_and_unload()
merged.save_pretrained("./qwen-7b-merged")

Multi-tenant: serve unmerged with hot-swap. If you’re serving N adapters against the same base model; one per customer, one per use case, one per A/B test variant; keep them unmerged. vLLM 0.7+ supports native dynamic LoRA loading: the base model lives permanently in GPU memory, adapters sit in CPU RAM (or NVMe), and the runtime pages an adapter onto the GPU per request; a transfer that takes single-digit milliseconds because the payload is tiny. The --enable-lora and --max-loras flags configure how many adapters can be resident on the GPU at once; PCIe transfer overhead is amortized across the request’s prefill and decode passes.

bash

1
2
3
4
5
6
7
8
9
# Launch vLLM with multi-LoRA serving
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --enable-lora \
  --max-loras 8 \
  --max-lora-rank 64 \
  --lora-modules \
    tenant-a=./adapters/tenant-a \
    tenant-b=./adapters/tenant-b \
    tenant-c=./adapters/tenant-c

At inference time, the client requests the adapter by name in the model field of the API call, and vLLM routes the request through the correct LoRA. The cost economics flips here: instead of provisioning one GPU per fine-tuned model, you provision one GPU and host dozens to thousands of tenants on it. The break-even point is around 3–4 adapters; below that, single-adapter merged serving is faster, but above it, multi-LoRA dominates by an order of magnitude on dollar cost per request.

The S-LoRA paper (Sheng et al., 2023) is the foundational work on this serving pattern; they showed that with batched LoRA computation and adapter caching policies, throughput stays close to base-model throughput even with thousands of adapters resident on a single GPU. vLLM’s implementation is the production-grade descendant of that work.