$ cat ai-engineering/pretraining-to-rlhf.md

From Pre-Training to RLHF

LLM pretraining, supervised fine-tuning, preference optimization, and verifiable rewards.

Jatin Bansal@blog:~/ai-engineering$ open pretraining-to-rlhf

An API model is the result of several training stages. Pretraining learns statistical structure from a large corpus. Supervised fine-tuning (SFT) teaches the model to produce assistant responses in desired formats. Preference optimization changes the relative likelihood of plausible responses. Some reasoning models add reinforcement learning against verifiable outcomes. Each stage uses different data and a different objective, so a failure introduced in one stage cannot always be repaired by another.

Definition

The InstructGPT pipeline established a widely used sequence: pretraining, SFT, reward-model training, and reinforcement learning. “Post-training” now covers SFT and the preference or reinforcement stages applied after the base model is trained. Direct preference methods can replace the separate reward model and PPO loop, while verifiable tasks can use deterministic rewards.

Stage 1: Pretraining

Pretraining minimizes next-token cross-entropy over a large, cleaned corpus containing text, code, mathematics, dialogue, and other sources. Every token position supplies a training target. The output is a base model that can continue many kinds of text but has not necessarily learned the turn structure or response policy expected from a chat assistant.

The corpus determines much of the model’s factual coverage and linguistic capability. Later training can make existing capability easier to elicit and can teach limited new behavior, but it is an inefficient way to maintain changing factual knowledge. Retrieval is usually better suited to knowledge injection.

Base-model behavior also follows the distribution of its corpus rather than a product policy. Chat templates, refusal patterns, tool-call formats, and other interface conventions are introduced during post-training. This is why evaluating a base checkpoint with an instruction-tuned prompt can produce misleading results.

Stage 2: Supervised Fine-Tuning (SFT)

SFT trains on prompt-and-demonstration examples. The loss is usually masked so only assistant tokens contribute. The model learns the chat template, instruction-response relationship, output formats, and behaviors represented in the demonstrations. Examples may be written by people or distilled from a stronger model.

SFT imitates demonstrations; it does not directly learn a ranking over several acceptable answers. Its behavior reflects the coverage and consistency of the dataset. Conflicting styles average together, rare cases remain weak, and memorized refusal wording may not generalize to adversarial paraphrases. Preference data addresses ranking more directly, although it introduces its own label-quality problems.

The minimal SFT loop in TRL looks like this:

python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# pip install "trl>=0.22" "transformers>=4.46" "datasets>=3" accelerate peft
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTConfig, SFTTrainer

MODEL = "Qwen/Qwen2.5-0.5B"  # small enough to fine-tune on a single GPU
model = AutoModelForCausalLM.from_pretrained(MODEL)
tokenizer = AutoTokenizer.from_pretrained(MODEL)

# UltraChat is a public SFT dataset of multi-turn assistant dialogues
ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft[:5000]")

config = SFTConfig(
    output_dir="./sft-qwen",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    learning_rate=2e-5,           # ~100× lower than pretraining
    logging_steps=20,
    max_length=1024,
    completion_only_loss=True,    # mask out user-turn tokens from the loss
)

trainer = SFTTrainer(
    model=model,
    args=config,
    train_dataset=ds,
    processing_class=tokenizer,
)
trainer.train()
trainer.save_model("./sft-qwen")

completion_only_loss=True excludes user tokens from the training objective. Confirm that the selected chat template and dataset expose the assistant-token mask correctly; otherwise a syntactically valid run may optimize the wrong positions. LoRA can reduce the trainable parameter and optimizer footprint when full-parameter SFT is unnecessary.

Stage 3: Preference Optimization

Preference optimization trains on judgments between plausible responses. A common record contains (prompt, chosen, rejected). The objective increases the relative probability or reward of the chosen response while constraining movement away from the SFT model.

Era 1 (2022–2023): PPO-style RLHF

The original InstructGPT pipeline split preference optimization into two sub-stages. train a reward model, a separate transformer head, to score a response as a scalar by minimizing a Bradley-Terry-style loss on preference pairs. treat the language model as a policy and the reward model as the environment, then run proximal policy optimization (PPO) to push the policy toward higher-reward outputs. A KL-divergence penalty against the SFT model limits policy drift. This reference-model anchor reduces reward hacking and mode collapse.

PPO-style RLHF requires a policy, reference policy, reward model, value model, rollout generation, and an optimizer. The learned reward is an imperfect proxy, so the policy can exploit reward-model errors. KL regularization, held-out human evaluation, and inspection of high-reward samples are part of the training loop rather than optional checks.

Era 2 (2024–2025): DPO and direct preference methods

Direct Preference Optimization (Rafailov et al., 2023) expresses the preference objective as a supervised loss under the Bradley-Terry model. It avoids a separately trained reward model, PPO sampling loop, and value model. Training still compares the policy with a frozen reference model for chosen and rejected responses.

Direct Preference Optimization (DPO) avoids explicit reward-model training and online PPO rollouts. It is often easier to operate, but still depends on a reference model, preference quality, and the strength of the regularization. IPO modifies the objective to limit overfitting to separable preferences. KTO can use unpaired desirable and undesirable examples, which fits some production-feedback datasets better than paired comparisons.

The minimal DPO loop in TRL:

python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# pip install "trl>=0.22"
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import DPOConfig, DPOTrainer

MODEL = "./sft-qwen"  # the output of the SFT stage above
model = AutoModelForCausalLM.from_pretrained(MODEL)
tokenizer = AutoTokenizer.from_pretrained(MODEL)

# UltraFeedback is a public preference dataset of (prompt, chosen, rejected) triples
ds = load_dataset("trl-lib/ultrafeedback_binarized", split="train[:5000]")

config = DPOConfig(
    output_dir="./dpo-qwen",
    num_train_epochs=1,
    per_device_train_batch_size=2,
    learning_rate=5e-7,           # DPO uses lower LRs than SFT
    beta=0.1,                     # KL strength; 0.1 is the canonical default
    logging_steps=20,
)

trainer = DPOTrainer(
    model=model,
    args=config,
    train_dataset=ds,
    processing_class=tokenizer,
    # ref_model is auto-loaded from MODEL if not specified
)
trainer.train()
trainer.save_model("./dpo-qwen")

beta controls the preference-versus-reference trade-off under TRL’s DPO convention. Treat 0.1 as an experiment starting point, then tune against held-out preference accuracy and general-capability evaluations. The meaning of the direction can differ across formulations and libraries, so verify the implementation rather than relying on a remembered rule.

Era 3 (2025–2026): RLAIF and RL with verifiable rewards

Reinforcement Learning from AI Feedback (RLAIF) uses model-generated preference labels, sometimes against a written constitution. It can expand a dataset cheaply, but the judge’s systematic errors become training signal. Human calibration sets and disagreement sampling help detect this transfer.

Reinforcement Learning with Verifiable Rewards (RLVR) uses outcomes such as unit tests, exact mathematical answers, or schema validation. DeepSeek-R1 used GRPO, which estimates advantage from groups of sampled responses without a separate critic. Verifiable rewards are strongest where the checker measures the full task; a weak checker invites reward hacking just as a learned reward model does.

These methods can be composed, but the order is a design choice. Preserve a checkpoint and evaluation suite at each boundary. A later stage can improve its target metric while reducing instruction following, calibration, safety behavior, or domain performance learned earlier.

Data and evaluation boundaries

Each stage needs a separate data contract. Pretraining examples are token sequences with document-level provenance and filtering metadata. SFT examples need a chat template, an explicit assistant mask, and a policy for multi-turn conversations. Preference records need the prompt, both responses, the judgment, the label source, and enough metadata to detect repeated annotators or model-generated duplicates. Verifiable-reward records also need a checker version because changing a unit test changes the learning objective.

Split data by source or task family before generating near-duplicates. A random row split can put paraphrases, adjacent document chunks, or several sampled responses to the same prompt on both sides of the boundary. The resulting validation loss looks better without measuring generalization. Deduplicate across train and evaluation sets, and keep a small contamination audit for public benchmarks.

Evaluate the checkpoint produced by every stage. A base model needs language-model loss and capability benchmarks. SFT adds instruction following, format validity, and turn-taking. Preference optimization adds pairwise preference win rate, but that metric must be accompanied by general-capability, calibration, safety, and style checks. RLVR needs held-out problems and adversarial tests of the verifier. The training reward is unsuitable as the only evaluation because the optimizer was trained to increase it.

Compare against the previous checkpoint with paired prompts and fixed decoding settings. Where sampling is part of the product, repeat generations under several seeds and report uncertainty. Keep examples of regressions, not only aggregate scores; a small average gain can conceal a severe loss in one domain.

Common post-training failures

Reward hacking occurs when the model discovers an output that scores well without satisfying the intended task. It can exploit a learned judge, a parser, or an incomplete unit test. Inspect the highest-reward samples and strengthen the evaluator before increasing optimization pressure.

Mode narrowing appears when preference training repeatedly rewards one style or response pattern. Diversity, calibration, and performance on neutral prompts may decline even while preference accuracy rises. A reference-model constraint limits movement but does not replace broad evaluation.

Catastrophic forgetting is most likely when the new dataset is narrow or the learning rate is too high. Mix replay examples from general capabilities, use conservative updates, and stop on held-out regressions. If only a small behavior change is required, parameter-efficient fine-tuning can simplify rollback and comparison, although it does not eliminate forgetting inside the adapter’s effect.

Label bias is a data problem rather than an optimizer problem. Annotator instructions, judge prompts, sampling temperature, and response ordering can all alter preferences. Measure agreement, randomize presentation order, retain disagreement, and avoid converting uncertain judgments into confident binary labels.