$ cat ai-engineering/finetuning-vs-rag.md

Fine-Tuning vs RAG: When to Choose Which

How to choose fine-tuning, retrieval, or both based on behavior, knowledge freshness, cost, and evaluation.

Jatin Bansal@blog:~/ai-engineering$ open finetuning-vs-rag

Fine-tuning changes model behavior stored in weights. Retrieval supplies external evidence at inference time. Stable requirements such as response format or domain style may justify fine-tuning; frequently changing documentation and facts usually belong in retrieval. Many systems use both because the two components have different update cycles.

When fine-tuning is the right tool

Pull the criteria together and the affirmative case for fine-tuning has four shapes.

1. Stable behavior that prompting can’t reliably enforce. You have a target output format, voice, refusal pattern, or chain-of-thought style that you can describe in a rubric, that an evaluator can grade reliably, and that prompting alone; even with strong examples; produces only ~80% adherence on. The classic case is a strict JSON schema with deeply nested optional fields, a customer-service voice that has to thread a specific brand register across long answers, or a refusal policy that has to be uniform across many phrasings of out-of-scope queries. A few thousand labeled examples fine-tune the behavior to 99%+, and the prompt no longer needs the multi-shot demonstration block; which incidentally cuts input tokens by 60–80% per request, paying back the fine-tune cost in inference savings within weeks at high volume.

2. Latency-sensitive paths where retrieval is too expensive. The retrieval round-trip; embed-the-query, ANN-lookup, hydrate-the-chunks, append-to-prompt; is typically 50–200ms on a tight vector store and more if you have a reranker in the cascade. For sub-100ms response paths (voice assistants, real-time UI affordances, latency-critical user flows), the retrieval cost is the wrong primary lever; fine-tuning the model to absorb the narrow knowledge it needs for that specific path avoids the lookup entirely. The narrowness matters; this only works when the knowledge surface for the latency-sensitive path is small enough that the model can absorb it, which means a few hundred entities, not the full corpus.

3. High volume at the cost frontier. A workload that ships 100M+ requests/month is the right shape for fine-tuning a smaller open-weight model; a Llama 3.3 or a Qwen 2.5; and self-hosting on dedicated GPUs. The break-even is usually around the $200k/year mark, where the headroom to operate your own model serving infrastructure pays back the cost of self-hosting versus per-token API pricing. The fine-tune is part of the migration plan; you can’t just swap in the smaller model cold; it needs to be fine-tuned on the workload’s task distribution to close the quality gap against the frontier model it’s replacing.

4. Distillation from a frontier teacher to a deployable student. Generate the training corpus by running a smaller share of your traffic through the frontier model (Sonnet 4.6, GPT-5.5, Opus 4.7), capturing input-output pairs, and fine-tuning your deployable target (Haiku 4.5, GPT-5.4-mini, or an open-weight model) on that corpus. This is the production pattern most teams converge on once the data flywheel is running: the frontier model is the labeler, the fine-tuned cheaper model is the deployment target, and the LLM-as-judge infrastructure validates that the student hasn’t drifted off the teacher’s quality envelope on the workload’s evals. Distillation as fine-tuning is structurally different from fine-tuning for behavior change; the goal is capability transfer, not behavior installation; but the mechanics are the same.

Outside these four shapes, fine-tuning is usually the wrong tool, even when the problem feels like it should be a fine-tuning problem.

When fine-tuning fails

The knowledge-injection trap is the most important failure mode to internalize. Two pieces of recent research land the same conclusion from different angles.

Mecklenburg et al. (2024), “Injecting New Knowledge into Large Language Models via Supervised Fine-Tuning” compared supervised fine-tuning to RAG on a controlled benchmark of post-training facts. Document-format training data (raw paragraphs) produced near-zero retention. QA-format training data; synthetic question-answer pairs over the same facts; performed substantially better but still underperformed RAG by a clear margin. The mechanism: the gradient signal in document training is dominated by language modeling loss across the whole text, not by the specific entity-relation-entity tuple you want stored; QA format concentrates the signal on the answer span, which works better but still falls short of explicit retrieval.

Soudani, Kanoulas, and Hasibi (2024), “Fine Tuning vs. Retrieval Augmented Generation for Less Popular Knowledge” made the long-tail story sharper: RAG outperforms supervised fine-tuning by a large margin for less-popular entities, and the gap widens as popularity decreases. Fine-tuning compresses the frequent patterns well and forgets the long tail; retrieval treats every entity equivalently and pulls the right tuple regardless of training-corpus frequency. The honest reading: if your application’s value is in handling the long tail accurately, fine-tuning is the wrong tool.

Companion work in the same vein found that unsupervised fine-tuning on documents (the cheapest fine-tuning shape) provided “only limited gains over base models”; barely outperforming the un-tuned model on factual recall. Supervised fine-tuning on task-specific data did better, but the conclusion remained: for inserting new knowledge into a model, RAG dominates the win-rate across model sizes and knowledge popularity bands.

Three more failure modes are worth naming explicitly.

The catastrophic-forgetting tax. A naively fine-tuned model loses general capability outside the fine-tuning distribution. The mitigation is LoRA/QLoRA adapters; low-rank updates that touch only a small fraction of the parameters, so the base model’s behavior outside the adapter’s task is largely preserved. The cost: adapter merging at inference, rank selection (typically 8–64 for behavior tasks), and a per-task adapter ZIP file you have to deploy alongside the base. Full-weight fine-tuning is reserved for cases where the behavior is so pervasive it justifies a full model fork; for most production behavior tweaks, LoRA is the default.

The eval-gap surprise. A team fine-tunes, sees 5–10% improvement on the offline eval, ships it, and watches production quality drop on slices the eval didn’t cover. The fine-tune optimized for the eval distribution and silently regressed on slices outside it. The mitigation is the eval-driven-development discipline: a properly stratified golden set with explicit coverage of the long-tail slices the fine-tune might harm, and a drift detector that catches the regression in production if the eval missed it. Fine-tuning without strong evals is sniping at the eval metric, not improving the workload.

The lifecycle-ownership debt. The fine-tune is a piece of code; it has a version, an owner, a CI pipeline, a deploy story, a rollback story, an eval suite, a re-train cadence, and a deprecation plan. Most teams that fine-tune in production are surprised by how much of this they have to build. The Hugging Face Hub and the Anthropic Console hide the build/serve plumbing; they don’t hide the ops cost of maintaining the artifact over a multi-year horizon. The honest cost model for a fine-tune includes the engineer-month of lifecycle work, not just the dollar cost of the training run.

The decision tree

The 2026 production default isn’t “fine-tune or RAG”; it’s a sequence that escalates in cost and complexity, with each step justified by an explicit failure of the previous one.

text

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
1. PROMPT ENGINEERING  →  Try first. Cheapest, fastest, most reversible.
   Failure mode that justifies escalation:
   - Quality plateau on evals you trust, despite system-prompt iteration,
     few-shot examples, and structured output coercion.
   - Token cost on the few-shot demonstration block becomes meaningful at scale.

2. RAG                  →  Add when the bottleneck is "the model doesn't know
                           our facts" or "the facts change too fast to bake in."
   Failure mode that justifies escalation:
   - Eval shows model still doesn't follow the application's format or voice,
     even with retrieved context — i.e. the gap is behavioral, not factual.
   - Retrieval latency dominates the latency budget on a path where it can't.
   - You've reached the long-tail floor where retrieval over your corpus is
     no longer recovering the precision you need.

3. FINE-TUNE (LoRA)     →  Add when the bottleneck is "the model doesn't behave
                           the way our application requires" or "we need
                           cheaper deployable artifacts for high-volume paths."
   Failure mode that justifies escalation:
   - Per-token cost on the frontier model is unsustainable and you've already
     routed and cached aggressively.
   - You need a self-hosted artifact for compliance, data-residency, or
     latency-floor reasons.

4. DISTILL              →  Add when you have a high-volume workload, a quality
                           bar set by a frontier model, and an eval suite that
                           can verify the student tracks the teacher.

This is essentially the order that Hamel Husain recommends: do as much prompt engineering as possible before you fine-tune, because the prompt is the cheapest tool and prompting well is also a great stress test of your eval suite. The 2026 consensus that ScalaCode, BigDataBoutique, and other recent guides converge on is also this sequence: prompt → RAG → fine-tune → distill. The order matters because each step is more reversible than the next, and each step’s right-or-wrong answer is detectable on an eval suite that’s been refined by all the previous steps.

Fine-tune behavior and retrieve facts

The production pattern that wins in 2026 is the hybrid one. The fine-tune installs the behavior, the retriever supplies the facts, the eval suite enforces both. Three concrete shapes.

Pattern A: Behavior-only LoRA + full RAG. Fine-tune a small open-weight model (Llama 3.3 70B, Qwen 2.5 32B) with a behavior-only LoRA; output format, voice, refusal language, tool-use shape. The training corpus is small, ~5k–20k examples generated from a frontier teacher (Sonnet 4.6) and validated by an LLM judge. At inference, retrieve from the live RAG corpus, build the prompt, call the fine-tuned model. The fine-tune is stable across knowledge updates because it doesn’t encode any knowledge; the RAG handles the volatile half cleanly.

Pattern B: Distilled student + thin behavioral fine-tune. For a high-volume workload where economics demand a cheap model. Run a sample of traffic through the frontier teacher, capture (input, output) pairs, fine-tune the student (Haiku-class or open-weight) on the corpus. The student picks up both the behavior and a meaningful chunk of the teacher’s reasoning quality on the workload’s distribution. Layer RAG on top to keep the knowledge surface live. The hardest engineering problem here is the eval; proving that the student is “good enough” requires a judge or rubric that can measure the workload’s specific quality dimension, which is the LLM-as-judge and eval-driven-development story applied in production.

Pattern C: Frontier with prompt engineering only. The default for early-stage products. Prompt engineering plus RAG plus prompt caching plus model routing usually gets you within 80% of where fine-tuning would land, at a tenth of the engineering cost. The honest test: if you can’t articulate which specific behavior the fine-tune is going to change, you’re not ready to fine-tune.

Code: a hand-rolled decision harness in Python

The skeleton below implements the decision logic at the boundary between the application and the model; it picks RAG-only, fine-tuned-with-RAG, or frontier-with-RAG per request, based on the workload’s classifier. The pattern composes with the model-routing primitive from the previous article. Dependencies: Anthropic SDK and pgvector.

python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
# pip install anthropic psycopg[binary] pgvector
import os
from dataclasses import dataclass
from typing import Literal

from anthropic import Anthropic
import psycopg

client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

@dataclass
class Route:
    model: str               # the deployable model for this request
    rag: bool                # whether to retrieve context
    use_finetune: bool       # whether to call the fine-tuned variant


def classify_request(text: str) -> Literal["routine", "domain_heavy", "novel"]:
    """A real classifier; here a stub.

    routine     -> high-volume, behavior-bound, fine-tuned student wins
    domain_heavy -> needs RAG over the knowledge base
    novel        -> outside the training distribution, frontier + RAG
    """
    if "policy" in text or "doc" in text or "how do I" in text:
        return "domain_heavy"
    if len(text) < 80:
        return "routine"
    return "novel"


def pick_route(text: str) -> Route:
    kind = classify_request(text)
    if kind == "routine":
        # behavior-tuned student, no RAG cost on the latency-sensitive path
        return Route(model="claude-haiku-4-5", rag=False, use_finetune=True)
    if kind == "domain_heavy":
        # RAG carries the volatile knowledge; mid-tier model carries the synthesis
        return Route(model="claude-sonnet-4-6", rag=True, use_finetune=False)
    # novel -> escalate to the frontier with RAG safety net
    return Route(model="claude-opus-4-7", rag=True, use_finetune=False)


def retrieve(query: str, conn: psycopg.Connection, k: int = 6) -> list[str]:
    # Hybrid retrieval: embed the query, ANN lookup, lex fallback.
    # Embedding fetch elided for brevity; assume `query_vec` is computed.
    query_vec = embed(query)  # produces list[float] of length D
    with conn.cursor() as cur:
        cur.execute(
            """
            SELECT chunk_text FROM doc_chunks
            ORDER BY embedding <=> %s::vector
            LIMIT %s
            """,
            (query_vec, k),
        )
        return [row[0] for row in cur.fetchall()]


def serve(text: str, conn: psycopg.Connection) -> str:
    route = pick_route(text)
    system_prompt = SYSTEM_PROMPTS["finetuned" if route.use_finetune else "base"]
    user_block = text
    if route.rag:
        chunks = retrieve(text, conn, k=6)
        ctx = "\n\n".join(f"<source>{c}</source>" for c in chunks)
        user_block = f"<context>\n{ctx}\n</context>\n\n<question>{text}</question>"

    resp = client.messages.create(
        model=route.model,  # if use_finetune, this would be the deployed variant ID
        max_tokens=1024,
        system=system_prompt,
        messages=[{"role": "user", "content": user_block}],
    )
    return resp.content[0].text


SYSTEM_PROMPTS = {
    # The fine-tuned student already encodes the brand voice and format,
    # so the system prompt collapses to a one-liner. The base model needs
    # the full behavioral demo set.
    "finetuned": "Answer concisely in the established support voice.",
    "base": (
        "You are a customer support assistant for ACME. "
        "Speak in second person, lead with the answer, "
        "cite sources from <source> tags inline, and refuse "
        "out-of-scope requests with: 'That's outside what I can help with.'"
    ),
}

The decision logic is the spine; everything else is plumbing. The pattern’s key invariant: the behavior contract (system prompt, output shape) shifts based on whether you’re calling the fine-tuned variant, but the knowledge contract (RAG context) shifts based on the request’s domain density. The two axes are independent and the harness lets them be set independently per request.