$ cat ai-engineering/dpo-and-alignment.md

DPO and Modern Alignment

Direct Preference Optimization, its common variants, production workflow, and failure modes.

Jatin Bansal@blog:~/ai-engineering$ open dpo-and-alignment

Direct Preference Optimization trains on (prompt, chosen, rejected) triples without a separate reward model or online PPO loop. Operationally it resembles supervised fine-tuning with a custom loss and a frozen reference model, making preference optimization much easier to run and debug than a multi-component RLHF pipeline.

The DPO training contract

Direct Preference Optimization (DPO) is a supervised-learning algorithm that fits a language model to a preference dataset by directly optimizing a closed-form expression of the optimal RLHF policy, without training a separate reward model and without running a reinforcement-learning loop. The input is the same (prompt, chosen, rejected) triples that an RLHF reward model would train on; the output is a model whose likelihoods on chosen responses go up relative to rejected ones, regularized by KL divergence against a frozen reference model. Mechanically, DPO is one custom loss function plugged into a standard Trainer, with the only oddity being that each batch requires two forward passes; one through the trainable policy and one through the frozen reference; so the chosen/rejected log-probability ratios can be compared.

The reason DPO matters operationally is the absence of the reward model. The original InstructGPT-style RLHF pipeline (Ouyang et al., 2022) required training a reward model on the preferences, then running PPO with that reward model in the loop; six components, each a separate forward pass at training time, each a tuning surface, each a place for the pipeline to silently misbehave. The reward model in particular is the brittle joint: once the policy learns to exploit a gap in the reward model’s coverage, the entire pipeline starts producing reward-hacked outputs that score well on the reward model and poorly on humans. DPO sidesteps this entirely. The reward model is implicit; it falls out of the closed-form derivation; and there’s no separate component to hack.

Understand the DPO loss

Two pieces are worth pulling apart because they govern the practical behavior of the trainer.

The implicit reward margin. Define r̂_θ(x, y) = β log(π_θ(y|x) / π_ref(y|x)). This is the “implicit reward” the trained model assigns to a response, expressed as a log-likelihood ratio scaled by β. The DPO loss pushes r̂_θ(x, y_w) - r̂_θ(x, y_l) up; the implicit reward gap between chosen and rejected. β controls the conversion factor between log-likelihood ratios and reward. High β (1.0+) means a small log-likelihood gap counts as a big reward gap; the loss is satisfied with small moves away from the reference and the policy stays close to the SFT model. Low β (0.01–0.05) means the log-likelihood gap has to be large to register as reward; the policy drifts further from the reference and learns more aggressively. The canonical default is 0.1, and most production runs sit between 0.05 and 0.5; Llama 3’s post-training used β = 0.1.

The two forward passes. Each training step requires four log-probabilities: log π_θ(y_w|x), log π_θ(y_l|x), log π_ref(y_w|x), log π_ref(y_l|x). The first two come from the trainable model; the second two come from the frozen reference. In TRL the reference model is loaded automatically as a copy of the policy at the start of training, kept on the same device, and never updated. This doubles the GPU memory footprint of the training run versus SFT; both models need to fit on the device, plus optimizer state for one of them. For large models, PEFT/LoRA sidesteps this by loading the reference model in 4-bit and applying LoRA adapters to the trainable copy, dropping the memory cost back to roughly SFT levels.

A subtler point about the loss: DPO does not push up the absolute likelihood of the chosen response. It pushes up the relative likelihood of chosen versus rejected. In practice, DPO often decreases the absolute likelihood of both responses; chosen drops a little, rejected drops a lot, the margin goes up, the model “wins” on win-rate metrics. This is the chosen-probability-drop phenomenon that DPO papers have been documenting since 2024, and it’s the single most surprising fact about DPO training to engineers who came from supervised learning. The model is getting better at the comparative objective; whether it’s also getting better at the generative objective is a separate question.

Choose a loss variant deliberately

DPO spawned a family of follow-ups, each fixing one of DPO’s documented limitations. The four worth knowing in 2026:

IPO (Identity Preference Optimization, Azar et al., 2023). Fixes the overfitting failure mode where DPO’s sigmoid loss can drive the model to extreme certainty on a few high-confidence preferences while ignoring the rest. The fix is to replace the sigmoid with a squared loss on the raw implicit reward margin; the loss is bounded, gradient saturation can’t happen, and the model can’t trade off everything else for hyper-certain agreement on a handful of pairs. In TRL: loss_type="ipo". Use it when your preference data has a long tail of subtle preferences that DPO would otherwise drown out by over-fitting on the loud pairs.

KTO (Kahneman-Tversky Optimization, Ethayarajh et al., 2024). Drops the paired-comparison data shape entirely. Instead of (prompt, chosen, rejected) triples, KTO trains on (prompt, response, label) rows where label ∈ {desirable, undesirable}; exactly the shape of production thumbs-up/down feedback. The loss is grounded in prospect theory’s value function; the asymmetry between gain and loss that Kahneman & Tversky identified; and applies different loss weights to desirable and undesirable signals, mirroring loss aversion. The practical implication is enormous: instead of asking annotators to compare pairs (slow, expensive, hard to get right when both responses are mediocre), you can train directly on the thumbs your users are already leaving. In TRL: KTOTrainer. Use it when your data is unpaired binary feedback and you have substantially more “good” than “bad” rows (or vice versa); KTO handles imbalanced classes naturally where DPO needs careful pairing.

ORPO (Odds Ratio Preference Optimization, Hong et al., 2024). Fuses the SFT and DPO stages into a single objective that does both jobs in one pass. The loss is L_ORPO = L_SFT(y_w | x) + λ · L_OR(y_w, y_l | x), where the SFT term is standard cross-entropy on the chosen response and the OR term is an odds-ratio loss on the preference. No reference model needed; ORPO is reference-free, because the SFT term anchors the model to the chosen response directly, and the OR term sharpens the margin against the rejected response. Memory cost is roughly half of DPO (no reference model on the device), training time is roughly half (no second forward pass), and the original paper showed ORPO-fine-tuned Phi-2/Llama-2/Mistral outperforming SFT+DPO at smaller scales on AlpacaEval and MT-Bench. Use it when you don’t already have an SFT checkpoint and want to do SFT + preference optimization in one pass, particularly when memory is tight. The trade-off: ORPO needs more diverse preference data than DPO to converge well, because it can’t lean on a pre-aligned SFT base.

SimPO (Simple Preference Optimization, Meng et al., 2024). The newer reference-free alternative. SimPO replaces the log-likelihood ratio against the reference with the average (length-normalized) log-likelihood of the response itself, and adds a target margin term to widen the gap. The loss is:

text

1
L_SimPO = -log σ(β · (avg_log_π(y_w|x) - avg_log_π(y_l|x)) - γ)

where γ is the target margin. Two things to notice. The average log-likelihood; divided by token count; directly attacks DPO’s length-bias problem, because longer responses can’t accumulate higher rewards just by being longer. The absence of π_ref means SimPO’s training cost is half of DPO’s (one forward pass per batch). SimPO’s NeurIPS 2024 paper reported a 53.7 length-controlled win rate on AlpacaEval 2 against Llama-3-8B-Instruct; a benchmark that explicitly penalizes the verbosity DPO is biased toward. Use SimPO when length bias is the failure mode you’re seeing, or when memory pressure makes the reference model expensive.

The main variants differ as follows:

Method	Reference model?	Data shape	Primary fix vs DPO	TRL trainer
DPO	Required	Paired `(p, w, l)`	; (baseline)	`DPOTrainer`
IPO	Required	Paired	Bounds the loss, prevents overconfidence	`DPOTrainer(loss_type="ipo")`
KTO	Required	Unpaired `(p, y, label)`	Removes pairing requirement	`KTOTrainer`
ORPO	Not needed	Paired	Fuses SFT+DPO, reference-free	`ORPOTrainer`
SimPO	Not needed	Paired	Length-normalized, reference-free	`DPOTrainer(loss_type="simpo")`

The honest 2026 pattern: start with DPO at β=0.1, watch the length and chosen-probability metrics, and switch to SimPO if length bias dominates or to KTO if your data is unpaired. The other variants are situational.

Code: full DPO pipeline in TRL

The pipeline below assumes you already have an SFT checkpoint (per the training-stage overview). The setup is pip install "trl>=0.22" "transformers>=4.46" "datasets>=3" peft accelerate bitsandbytes.

python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
import torch
from datasets import load_dataset
from peft import LoraConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import DPOConfig, DPOTrainer

SFT_MODEL = "./sft-qwen"   # output of the SFT stage
OUT_DIR = "./dpo-qwen"

tokenizer = AutoTokenizer.from_pretrained(SFT_MODEL)
model = AutoModelForCausalLM.from_pretrained(
    SFT_MODEL,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# UltraFeedback is the canonical public DPO dataset: ~60k (prompt, chosen, rejected)
# triples derived from GPT-4 ratings of multi-model responses to a diverse prompt set.
ds = load_dataset("trl-lib/ultrafeedback_binarized", split="train")

# LoRA cuts memory by ~4x: trainable copy gets adapters, reference stays full-precision.
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

config = DPOConfig(
    output_dir=OUT_DIR,
    num_train_epochs=1,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,    # effective batch size 16
    learning_rate=5e-7,               # DPO LR is ~50x lower than SFT
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    beta=0.1,                         # KL strength; canonical default
    loss_type="sigmoid",              # vanilla DPO; switch to "ipo" or "simpo" as needed
    max_length=2048,
    max_prompt_length=1024,
    logging_steps=20,
    save_strategy="epoch",
    bf16=True,
    # length-normalization mitigation if you stay on vanilla DPO:
    # rpo_alpha=1.0,                  # adds SFT-style NLL on chosen to combat chosen-prob drop
)

trainer = DPOTrainer(
    model=model,
    args=config,
    train_dataset=ds,
    processing_class=tokenizer,
    peft_config=peft_config,
    # ref_model is auto-loaded from `model` if not specified; with LoRA, the base
    # model with adapters disabled serves as the reference, halving the memory cost.
)
trainer.train()
trainer.save_model(OUT_DIR)

Operational notes for this scaffold:

the rpo_alpha parameter (commented out above) implements Pang et al.’s iterative reasoning preference optimization (RPO) mitigation for the chosen-probability-drop problem. It adds a small SFT-style cross-entropy term on the chosen response to the DPO loss, keeping log π(y_w|x) from collapsing. Set rpo_alpha=1.0 if you’re seeing the chosen-probability metric in TRL’s training logs dropping below baseline; it’s a cheap fix.

Second, the learning rate (5e-7) is much lower than SFT (2e-5). DPO is a refinement pass; aggressive LRs blow out the alignment. If your loss curve is noisy or diverges, drop LR before changing anything else.

Third, the processing_class parameter replaced tokenizer in TRL’s API in version 0.22. This is the kind of API-drift breakage that bites pipelines pinned to older docs; if you’re following an older tutorial and hit a TypeError, this is usually why.

Iterate from observed model failures

A single DPO pass on a static preference dataset is the toy version. The production pattern, used by Llama 3’s post-training and most open-weight model releases since, is iterative DPO: alternate between sampling responses from the current model, scoring them with a reward model or LLM judge, and running a DPO pass on the resulting fresh preference pairs. Each iteration uses the previous iteration’s checkpoint as the reference model. The reason this matters: the preference data in a static dataset was generated against responses from some other model; not the model you’re training. As your model improves, its response distribution drifts away from the responses in the dataset, and the dataset becomes off-policy. Iterative DPO keeps the data on-policy.

The Llama 3 post-training pipeline used six iterations of this loop. The Tülu 3 release from AI2 used a similar shape with public datasets, making it the cleanest open reproduction of the pattern to study. The cost is roughly N× a single DPO run (where N is the number of iterations), plus the inference cost of sampling on-policy responses at each step, plus the labeling cost (whether human or LLM-judge). The benefit is that the final model is markedly better than what a single static pass would have produced; the consistent finding across published iterative-DPO runs is a 5–15% lift on AlpacaEval / Arena-Hard between iteration 1 and iteration 4-6.

The architecture sketch for iterative DPO is something like:

text

1
2
3
4
5
6
loop for N iterations:
    1. Sample K responses per prompt from the current policy
    2. Score them (reward model OR LLM judge OR rule-based verifier)
    3. Form (prompt, best, worst) triples — typically top vs bottom of the K samples
    4. Run a DPO pass with the current policy as both the trainable model and the reference
    5. Replace the current policy with the new checkpoint; goto 1

This is also the place where the feedback queues you built in the Evaluation subtree start paying off. Production thumbs and edits are on-policy preference data by construction; they were generated against the model you’re actually shipping. A production team running iterative DPO can use their accumulated user feedback as one of the data streams feeding step 2, alongside synthetic on-policy preferences scored by an LLM judge. The data flywheel and the post-training pipeline are the same loop seen from different ends.

DPO and Modern Alignment

The DPO training contract

Understand the DPO loss

Choose a loss variant deliberately

Code: full DPO pipeline in TRL

Iterate from observed model failures

Further reading from the field