jatin.blog ~ $
$ cat ai-engineering/pretraining-to-rlhf.md

From Pre-Training to RLHF

The three-stage LLM training pipeline — pretraining, SFT, preference optimization — what each step changes, why each exists, and the 2026 reality.

Jatin Bansal@blog:~/ai-engineering$ open pretraining-to-rlhf

Every model you call via API is the output of a pipeline, not a single training run. The base model that emerges from pretraining is a fluent autocomplete engine with no instinct for following instructions, no calibrated refusals, and no idea what a “helpful response” looks like. Turn that base model into Claude, GPT, or Gemini and you’ve layered two or three more training stages on top — supervised fine-tuning to install the format, preference optimization to install the taste. Each stage is solving a different problem; skipping any of them produces a recognizable failure mode. Understanding the pipeline is what lets you reason about why a model behaves the way it does, why fine-tuning your own model breaks the alignment you didn’t realize you were depending on, and why the 2026 production stack looks different from the InstructGPT diagram everyone copied from the 2022 paper.

Opening bridge

The last article in the Production & Operations subtree closed the application-side curriculum: how to ship, observe, and bound the behavior of an agent built on a model someone else trained. Today’s piece opens a new subtree — Training-Side Fundamentals — and walks one layer below the API. The reader who’s just spent fifty-five articles wiring retrieval, memory, agents, and eval infrastructure on top of frontier models now needs to know what’s inside the model they’ve been treating as a black box: why fine-tuning works on style but silently fails for knowledge injection, why RLHF and constitutional training reduce — but don’t eliminate — the need for runtime guardrails, and why the feedback signal you collect in production can flow back into the same pipeline that produced the model in the first place. The training-side subtree is short and optional, but it’s the substrate everything else sits on.

Definition

An LLM training pipeline is a sequence of optimization stages that progressively reshape a parameter set from a randomly initialized neural network into a deployable conversational model. The canonical three-stage shape — the one introduced by the InstructGPT paper (Ouyang et al., 2022) and copied across the industry — is pretraining → supervised fine-tuning (SFT) → reinforcement learning from human feedback (RLHF), with each stage modifying the weights via a different loss function, optimizer regime, and data distribution.

The three stages exist because no single objective can do all the work. Pretraining gives you a model that knows the world but not your intent. SFT gives you a model that knows the format (turn-taking, instruction following, refusal phrasing) but not your preferences (when to be terse vs verbose, when to ask a clarifying question, when to push back). Preference optimization — whether PPO-RLHF, DPO, or RLAIF — installs the taste that distinguishes a “competent” answer from a “good” one. The naming convention has drifted: “post-training” or “alignment” is the umbrella term for everything that happens after pretraining; SFT and preference optimization are the two main steps inside that umbrella.

The distributed-systems parallel

Two parallels do honest work here.

The pipeline is a multi-stage build, not a single compile. A real production binary isn’t the output of one gcc invocation — it’s preprocess → compile → assemble → link → strip → sign, each stage with its own input format, its own tool, its own failure modes, and its own observability surface. Skip the link stage and you have object files that contain the right code but can’t run. Skip the sign stage and the binary is rejected at install time. The LLM pipeline is the same shape: pretraining is the compile (the longest, most expensive step that produces the bulk of the artifact), SFT is the link (resolves the abstract symbols of “what does an instruction look like” against the concrete grammar of turns), and preference optimization is the sign (the final step that turns the artifact into something a user can safely run). The cost asymmetry is also the same — the compile costs 100× the sign, but the sign is the step that decides whether the binary ships.

The pipeline is a curriculum, not a single SGD run. The “curriculum” framing in machine learning — the idea that learning easy examples first and hard examples later improves final performance — applies one level up to the stages themselves. Pretraining is the easy curriculum: predict the next token, which works on any text. SFT is the medium curriculum: predict the assistant’s next token given a specific instruction format. Preference optimization is the hard curriculum: prefer the response that humans would pick given two candidates. Each stage builds on the substrate the previous stage installed. Run the curriculum out of order — try to do RLHF on a randomly initialized model — and the gradient signal is too sparse to converge. Run the curriculum on the right substrate and each stage moves the model a relatively small distance, because the previous stage already got you most of the way there.

Stage 1: Pretraining

The objective is next-token prediction on a massive corpus. You scrape, clean, dedupe, and shuffle trillions of tokens of text — web pages, books, code, dialogue, math, scientific papers — and run them through a transformer in a self-supervised loop: for every position in every sequence, the loss is the cross-entropy of the model’s predicted distribution against the actual next token. The model learns nothing it’s told to learn; it learns whatever statistical regularities exist in the corpus, on the assumption that “good completion of arbitrary text” generalizes to “good performance on tasks you care about.”

The artifact at the end of pretraining is called the base model. Three things to internalize about base models, because each one matters for downstream stages:

  • The base model has no chat structure. Ask a base model “What is the capital of France?” and a plausible completion is “What is the capital of Germany? What is the capital of Italy?” — because the most common context in which that sentence appears is a list of similar questions, not an answer. The chat template (“user: … / assistant: …”) that turns it into a conversational agent doesn’t exist yet; the model has never been trained to recognize that distinction.
  • The base model has all the knowledge it will ever have. Every fact, every association, every learned style is installed during pretraining. Later stages can reshape how that knowledge surfaces, but they cannot add new facts at a meaningful rate — which is the underlying reason fine-tuning fails for knowledge injection while RAG succeeds. The pretraining corpus is the knowledge frontier; everything after is steering.
  • The base model is dangerously unaligned by default. Ask a base model how to do something harmful and the completion will be drawn from the most statistically plausible continuation of the prompt, which is often a competent walkthrough — because the internet contains both the question and the answer. The refusal behavior you take for granted on production models is entirely a product of the later stages.

The pretraining compute budget for a frontier 2025–2026 model is in the $50M–$500M range, dominated by GPU-hours on H100/H200/B200-class accelerators over several months. The post-training budget (the next two stages combined) is typically 1–3% of pretraining cost, but it’s where the model becomes a product. This asymmetry — pretrain is 99% of the cost, post-train is 99% of the user-visible behavior — is the cleanest argument that whoever owns the post-training stack owns the product, even if they don’t own the base model.

Stage 2: Supervised Fine-Tuning (SFT)

The objective shifts from “predict the next token of arbitrary text” to “predict the next token of an assistant response given a user prompt.” The data is curated (prompt, demonstration) pairs — thousands to hundreds of thousands of them, written by humans or distilled from a stronger model — and the loss is masked cross-entropy: the model is graded only on the tokens of the assistant turn, not on the user’s input. The optimizer is plain AdamW, the learning rate is several orders of magnitude lower than pretraining (because you’re nudging an already-trained model, not training from scratch), and the run typically completes in hours to days on far fewer GPUs than pretraining.

What SFT installs:

  • Turn structure. The model learns that <|user|> ... <|assistant|> ... is the canonical envelope. After SFT, the model produces an assistant turn when prompted with a user turn — without SFT, it’d produce another user turn.
  • Instruction following at the surface level. “Summarize this in three bullets” produces three bullets, not a continuation that pretends the instruction was part of the text. This is the single most visible change between a base model and an SFT’d model.
  • Format adherence. JSON when asked for JSON, markdown when asked for markdown, refusal phrasing when the demonstrations included refusals.
  • The base of the alignment signal. The demonstrations are the first place the model sees what a “good answer” looks like according to the labelers. The next stage refines this with comparisons, but SFT installs the demonstrations themselves.

What SFT does not install:

  • Calibration between candidate responses. SFT teaches the model to imitate one good response per prompt; it does not teach it to rank responses or prefer one over another. Two demonstrations on the same prompt — both reasonable but differently styled — are both equally encouraged.
  • Robust refusals under adversarial pressure. A demonstration of “I can’t help with that” against the canonical prompt for a harmful request doesn’t generalize to a paraphrased version of the same request. The refusal is brittle until preference optimization sharpens it.
  • The taste step. “Terse vs verbose,” “explanatory vs blunt,” “asks clarifying questions vs charges ahead” — these are stylistic axes the demonstrations bake in implicitly, but they’re noisy, depend on which labeler wrote which example, and don’t generalize uniformly across the prompt distribution.

The minimal SFT loop in TRL looks like this:

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# pip install "trl>=0.22" "transformers>=4.46" "datasets>=3" accelerate peft
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTConfig, SFTTrainer

MODEL = "Qwen/Qwen2.5-0.5B"  # small enough to fine-tune on a single GPU
model = AutoModelForCausalLM.from_pretrained(MODEL)
tokenizer = AutoTokenizer.from_pretrained(MODEL)

# UltraChat is a public SFT dataset of multi-turn assistant dialogues
ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft[:5000]")

config = SFTConfig(
    output_dir="./sft-qwen",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    learning_rate=2e-5,           # ~100× lower than pretraining
    logging_steps=20,
    max_length=1024,
    completion_only_loss=True,    # mask out user-turn tokens from the loss
)

trainer = SFTTrainer(
    model=model,
    args=config,
    train_dataset=ds,
    processing_class=tokenizer,
)
trainer.train()
trainer.save_model("./sft-qwen")

The completion_only_loss=True flag is the load-bearing detail. Without it, the model is graded on its prediction of the user’s prompt too, which dilutes the gradient and slows convergence — you want the loss concentrated on the assistant turn, because that’s the part you’re trying to shape.

For teams that don’t run their own GPUs, the equivalent path through a managed fine-tuning API in TypeScript:

typescript
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
// npm install openai
// Together, Fireworks, and Anthropic offer the same shape;
// here we use OpenAI's fine-tuning API as the most documented example.
import OpenAI from "openai";
import * as fs from "node:fs";

const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

// Upload a JSONL of {"messages": [{role,content}, ...]} examples.
const file = await client.files.create({
  file: fs.createReadStream("./sft-data.jsonl"),
  purpose: "fine-tune",
});

const job = await client.fineTuning.jobs.create({
  training_file: file.id,
  model: "gpt-4.1-mini-2026-04-14",
  method: { type: "supervised", supervised: { hyperparameters: { n_epochs: 2 } } },
});

console.log("created job", job.id, "status:", job.status);
// Poll job.status until "succeeded"; the result is job.fine_tuned_model

The two are functionally equivalent — both compute masked cross-entropy on demonstration data — but the managed API hides the infrastructure (the GPU choice, the batch size, the optimizer) behind a job submission. The TRL path is what you reach for when you want control, custom data preprocessing, or LoRA adapters; the managed API is what you reach for when the workload is shaped enough that the defaults work and you don’t want to operate the cluster.

Stage 3: Preference Optimization

The third stage is where the model learns which of two plausible responses a human would prefer. The data shape is a triple — (prompt, chosen, rejected) — and the objective is to push up the model’s probability of the chosen response relative to the rejected one. The mechanism for doing that has gone through three eras.

Era 1 (2022–2023): PPO-style RLHF

The original InstructGPT pipeline split preference optimization into two sub-stages. First, train a reward model — a separate transformer head — to score a response as a scalar, by minimizing a Bradley-Terry-style loss on the preference pairs (the chosen response should score higher than the rejected one). Second, treat the language model as a policy and the reward model as the environment, and run proximal policy optimization (PPO) to push the policy toward higher-reward outputs. A KL-divergence penalty against the SFT’d model keeps the policy from drifting too far — the “reference model” anchor prevents reward hacking and mode collapse.

PPO works, but the pipeline has six moving parts: the policy model, the reference model (a frozen copy of the SFT model), the reward model, the value model (PPO’s critic), the rollout sampler, and the optimizer. Each is a separate forward pass at training time, the hyperparameters are notoriously fiddly, and the reward model becomes the brittle point — once the policy learns to exploit a reward model gap, the whole pipeline collapses. The compute cost is roughly 5–10× SFT for comparable data volume.

Era 2 (2024–2025): DPO and direct preference methods

Direct Preference Optimization (Rafailov et al., 2023) collapsed the reward model and the PPO loop into a single supervised loss. The insight: under the Bradley-Terry assumption, the optimal RLHF policy has a closed-form expression in terms of the reference model, and you can derive a loss that directly fits that expression on preference pairs — no reward model, no PPO sampling loop, no value model. The training loop looks almost identical to SFT, with a custom loss function that uses two forward passes (the trainable model and the frozen reference) per batch.

The 2026 production pattern across well-run teams: SFT first, DPO second, PPO-RLHF only when you have a specific reason to need it. Llama 3 was post-trained with DPO, and most open-weight model releases since have followed the same shape. The two main DPO variants worth knowing: IPO (Identity-PO) which adds a regularizer that prevents DPO from over-fitting on confident preferences, and KTO (Kahneman-Tversky Optimization) which drops the paired-comparison requirement and works on standalone thumbs-up/thumbs-down signals — a better fit for the feedback queues you collect in production. The variant zoo, the derivation, and the failure modes specific to DPO are unpacked in DPO and Modern Alignment.

The minimal DPO loop in TRL:

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# pip install "trl>=0.22"
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import DPOConfig, DPOTrainer

MODEL = "./sft-qwen"  # the output of the SFT stage above
model = AutoModelForCausalLM.from_pretrained(MODEL)
tokenizer = AutoTokenizer.from_pretrained(MODEL)

# UltraFeedback is a public preference dataset of (prompt, chosen, rejected) triples
ds = load_dataset("trl-lib/ultrafeedback_binarized", split="train[:5000]")

config = DPOConfig(
    output_dir="./dpo-qwen",
    num_train_epochs=1,
    per_device_train_batch_size=2,
    learning_rate=5e-7,           # DPO uses lower LRs than SFT
    beta=0.1,                     # KL strength; 0.1 is the canonical default
    logging_steps=20,
)

trainer = DPOTrainer(
    model=model,
    args=config,
    train_dataset=ds,
    processing_class=tokenizer,
    # ref_model is auto-loaded from MODEL if not specified
)
trainer.train()
trainer.save_model("./dpo-qwen")

The beta hyperparameter controls how tightly the trained policy is anchored to the reference model. Low beta (0.01–0.05) lets the model drift further from the reference and chase the preference signal harder, at the cost of degraded fluency and the risk of mode collapse. High beta (0.3–1.0) keeps the model close to the reference and tames the drift but learns less. The canonical default of 0.1 is a reasonable starting point; tune from there.

Era 3 (2025–2026): RLAIF and RL with verifiable rewards

Two divergent directions emerged once DPO’s economic advantages were clear.

RLAIF (Reinforcement Learning from AI Feedback) replaces the human preference labels with labels produced by an LLM. Anthropic’s Constitutional AI (Bai et al., 2022) was the first large-scale documented use of this pattern: a model evaluates its own responses against a written constitution and labels which response better adheres to the principles. The labels then feed into either the reward model (CAI’s original setup) or directly into a DPO-style trainer. RLAIF scales — you can generate millions of preference pairs at the cost of inference, not the cost of human labelers — and it’s the only economically viable path to the volume of preference data that modern post-training runs require. The trade-off is that the labeler model’s biases become the trained model’s biases; “the constitution” is the formal specification of what biases you’re propagating, and Anthropic published a major revision in January 2026 that shifted from rule-based to reason-based principles.

RLVR (Reinforcement Learning with Verifiable Rewards) replaces the learned reward model with a deterministic reward signal — a unit test that either passes or fails, a math problem with a known answer, a format check that either validates or doesn’t. DeepSeek-R1 (DeepSeek-AI, 2025) demonstrated that RLVR with GRPO — a PPO variant that drops the value/critic model and computes advantage relative to a group of sampled responses — can produce strong reasoning capabilities without an SFT stage at all. The DeepSeek-R1-Zero variant skips SFT entirely and trains directly on RLVR from the base model. The technique is specific to domains where a verifier exists — math, code, structured outputs — but inside those domains it’s the current state of the art and the source of the “reasoning model” generation that emerged in 2025.

The three eras don’t replace each other so much as stack: a 2026 frontier model’s post-training pipeline is often SFT → RLAIF/DPO for general alignment → RLVR for reasoning capabilities, in that order, each stage anchored against the previous stage’s reference model.

Trade-offs, failure modes, gotchas

The alignment tax. Every post-training stage moves the model away from the pretraining distribution. The model that was good at completing arbitrary text becomes worse at it. This is the alignment tax — a measurable drop in raw benchmark performance traded for a large gain in instruction-following and refusal calibration. The tax is the strongest argument against over-aggressive RLHF; tune beta (in DPO) or the KL penalty (in PPO) too low and you can find a hyperparameter setting where the model is impressively aligned on the preference distribution and broken on everything else.

Reward hacking and mode collapse. PPO-RLHF’s most consistent failure mode: the policy finds a way to maximize reward-model output without producing the kind of response the reward model was supposed to reward. Classic symptom: the policy converges to a single high-reward template and produces it for every prompt regardless of context. DPO mitigates this by removing the reward model from the loop, but it has its own version — the policy can sharpen on the preference distribution and lose entropy on responses the preference data didn’t cover. The KL anchor against the reference model is the main mitigation in both regimes.

Distribution shift between stages. The preference data used for DPO/RLHF was generated against responses the model produced before preference optimization. As the model updates, its response distribution shifts; the preference data becomes off-policy. The mitigation is iterative DPO — generate fresh on-policy preference pairs from the current model, label them, retrain — but the labeling-and-retraining cycle is the bottleneck.

The labeler-is-the-model problem. With RLAIF, the labeler is itself a model with its own biases, blind spots, and failure modes. A constitution that says “be helpful” doesn’t guarantee that the labeler can reliably distinguish helpful from sycophantic responses; the trained model inherits the labeler’s confusions. This is the underlying reason CAI-style approaches publish their constitutions explicitly — the constitution is the spec the trained model is being shaped against, and a vague constitution produces a vague model.

Catastrophic forgetting. Post-training stages can degrade capabilities that pretraining installed, especially when the SFT or DPO dataset under-represents whole domains. A model fine-tuned heavily on customer-support dialogue can lose its math capability. The mitigation is data mixing — keep a fraction of pretraining-style data in every fine-tuning batch, or run continual pretraining alongside fine-tuning to keep the substrate alive.

The “instruct” model isn’t the base model with a hat. The instruction-tuned model you call via API is mechanically different from the base model — its weights have moved, sometimes significantly. The base model is occasionally released alongside (“Llama-3.1-8B” vs “Llama-3.1-8B-Instruct”) and the differences are non-trivial. If you’re fine-tuning, start from the base or from the instruct depending on whether you want to replace the alignment (rare) or extend it (usually).

Further reading from the field

  • DPO and Modern Alignment — the deep dive on the third stage. The derivation, the IPO/KTO/ORPO/SimPO variant zoo, the iterative DPO production pattern, and the length-bias / chosen-probability-drop failure modes you’ll hit in practice.
  • LoRA and Parameter-Efficient Fine-Tuning — the parameterization the SFT and DPO stages from this article actually run on top of in production. Full-parameter fine-tuning is no longer a serious option for most teams; PEFT is the substrate.
  • Fine-Tuning vs RAG: When to Choose Which — the application-side decision tree that builds on what this article taught about what SFT and DPO actually change. The knowledge-injection trap is a direct consequence of how the three-stage pipeline allocates capabilities across stages.
  • Human-in-the-Loop Feedback Loops — the production data-flywheel that produces the preference pairs the third stage trains on. The labeling discipline in that article is the upstream half of the RLAIF and DPO loops covered here.