jatin.blog ~ $
$ cat ai-engineering/finetuning-vs-rag.md

Fine-Tuning vs RAG: When to Choose Which

A decision tree for fine-tuning vs RAG: what each tool actually changes, the cost model, where each fails, and why most 2026 production stacks ship both.

Jatin Bansal@blog:~/ai-engineering$ open finetuning-vs-rag

A B2B SaaS team is two months into a customer-facing assistant that grounds answers in the customer’s product documentation. The pilot works; the revenue case lands; production rolls out behind a Sonnet 4.6 API. Six weeks later finance flags the bill — $32k/month and rising linearly with seats — and asks the engineering lead the obvious question: can we fine-tune this so we stop paying frontier per-token rates? The engineering lead’s instinct is to say yes; the cost math on routing the cheap 70% of queries to Haiku already saved 40%, and a fine-tuned smaller model could plausibly close the rest. The lead spends a week reading. They come back with a different answer: the cost problem is real, but fine-tuning the wrong layer. The documentation changes weekly. If we fine-tune the knowledge in, we’ll be retraining every Friday for the rest of the product’s life. The behavior — tone, format, refusal patterns, the way the assistant should structure follow-up questions — is what’s actually stable and worth baking in. We fine-tune the behavior, we keep the docs in retrieval. This is the framing the rest of the article makes precise.

Opening bridge

The last article in the Production & Operations subtree walked the case for routing requests between model tiers to cut cost without dropping quality. That’s the cheapest first lever — the one you should reach for before touching the model. Today’s piece walks the next lever in the same cost-quality arc: when routing isn’t enough, when the workload is uniform enough that a router can’t shed it, the choice becomes whether to change the model (fine-tuning) or change the prompt (retrieval). The reason these two are usually framed as competitors is that they superficially target the same complaint — “the base model doesn’t know my domain” — but they intervene at different layers and obey different cost laws. Routing told you which model to call; fine-tuning and RAG tell you what to give the model when you call it. The full Production & Operations subtree progression: serve cheaper (inference latency, speculative decoding) → call cheaper (model routing) → either fine-tune the call or augment its context (this article). The bridge to the next pieces in the subtree — guardrails and PII — is that fine-tuning and RAG are both also vectors for the safety story you’ll have to layer on top, but that’s the next mile of the road.

Definition

Fine-tuning is the process of updating a model’s weights on labeled task data to change its behavior; RAG is the process of supplying a model with relevant context at inference time to change its inputs. Fine-tuning is a write to the weights; RAG is a write to the prompt. The two are not interchangeable because they target structurally different surfaces — one is durable but coarse-grained, the other is ephemeral but precise. (For the training-side mechanics — the SFT and preference-optimization stages this article uses without unpacking — see From Pre-Training to RLHF and DPO and Modern Alignment.)

To make the contrast operationally usable, fix four properties along which they diverge.

PropertyFine-tuningRAG
Where the knowledge livesIn model weightsIn an external store retrieved at inference
When the cost is paidAt training (one-time, large)At inference (per-query, small but additive)
Refresh cadenceHours-to-days (retrain + redeploy)Seconds-to-minutes (index update)
What it changes wellStyle, format, refusal patterns, tool-call shape, output schema, narrow factual recall on stable corporaKnowledge frontier, attribution, factual recency, multi-domain coverage
What it changes badlyVolatile knowledge, long-tail facts, attribution requirementsStyle or behavioral consistency, latency-critical paths, format adherence under prompt-injection pressure

This table is the article in miniature. Every section below either argues a row, supplies the cost math that justifies a row, or shows the production pattern that emerges when you accept the rows.

Intuition: form vs. facts

The clearest mental compression is fine-tuning is for form, RAG is for facts. Form is everything that’s stable about how your application should respond — the tone, the structure of the output, the JSON shape, the refusal language for queries outside scope, the chain-of-thought style if any, the brand voice. Facts are everything that could change next week — the docs, the prices, the customer’s product catalog, this quarter’s policies, the support knowledge base. Form gets fine-tuned; facts get retrieved. The reason the framing works so well is that it isolates the two failure modes you actually see in production: a base model that gets the facts right but says it in a way that doesn’t fit your product (a fine-tuning problem), and a fine-tuned model that says everything in the right voice but with month-old facts (a RAG problem). The fix in each case is the other tool.

Hamel Husain’s “Is Fine-Tuning Still Valuable?” phrases it as: “Fine-tuning works best to learn syntax, style and rules whereas techniques like RAG work best to supply the model with context or up-to-date facts.” This is the consensus framing across Anthropic’s documentation and the OpenAI fine-tuning guide. The framing fails only when you try to use one to do the other’s job — which is exactly the failure mode most teams run into when they pick fine-tuning to “teach the model our docs” and then discover their retraining cadence is now coupled to their content-publishing cadence.

A second, complementary frame: fine-tuning is a compiler optimization, RAG is a runtime lookup. The compiler optimization bakes a hot path into the binary — fast at runtime, but the binary has to be rebuilt when the hot path changes. The runtime lookup is slower per call but the lookup table is hot-swappable. Production systems use both, applied to different surfaces, for exactly this reason.

The distributed-systems parallel

Two parallels do work here and one disanalogy is worth being honest about.

Fine-tuning is the read-mostly materialized view; RAG is the indexed query. A database team materializes a view when the underlying query is expensive, the result rarely changes, and the consumers want low-latency reads — the cost is the maintenance overhead when the view goes stale, plus the storage of the precomputed result. The same shape: fine-tuning is expensive once, fast forever after, and maintenance costs spike when the underlying ground truth shifts. RAG is the equivalent of an indexed query against the live table — paying the join cost per request, but always seeing the current row. The decision criterion is the same as the database one: how often does the data underneath change relative to how often you read it? Stable behavior plus very high read rate → materialize (fine-tune). Volatile data plus tolerable per-query cost → indexed query (retrieve).

Fine-tuning is recompiling the binary; RAG is loading a shared library. When the behavior you want is structural and load-bearing across every code path, you compile it in — it’s cheap at runtime, it composes with the optimizer, and it can’t be tampered with at runtime. When the behavior is plugin-shaped and tenant-specific, you ship a shared library and load it at startup — slower than compiled-in, but you can swap libraries without rebuilding the binary. The fine-tune is the compiler-into-binary path; the retrieved context is the dynamically-loaded plugin. The distinction also explains the prompt-cache story: a fine-tune fixes the model’s behavior in a way that survives every cache invalidation, while a retrieved chunk is fresh per request and necessarily defeats prefix caching past the retrieval boundary. (Cache placement strategy: retrieved context at the end of the prompt, fine-tuned-affected behavior in the system prompt at the start, so the cacheable prefix doesn’t move.)

The honest disanalogy. Database materialization is reversible — you can drop the view, fall back to the live query, and you’ve lost nothing but performance. Fine-tuning is not reversible in the same clean way. Once you’ve fine-tuned a model, that artifact has the behavior baked in; reverting to the base model means losing not just the unwanted behavior but also the wanted behavior the fine-tune installed correctly. This is why production fine-tuning runs always preserve the base model as a fallback target, ship adapter weights (LoRA/QLoRA) rather than full-weight forks where the framework supports it, and treat the adapter set as a first-class deployment artifact with its own version, eval suite, and rollback story. The “merge or don’t merge” decision on LoRA adapters is the AI-engineering equivalent of “monkey-patch or fork” — usually you want the adapter to remain swappable.

When fine-tuning is the right tool

Pull the criteria together and the affirmative case for fine-tuning has four shapes.

1. Stable behavior that prompting can’t reliably enforce. You have a target output format, voice, refusal pattern, or chain-of-thought style that you can describe in a rubric, that an evaluator can grade reliably, and that prompting alone — even with strong examples — produces only ~80% adherence on. The classic case is a strict JSON schema with deeply nested optional fields, a customer-service voice that has to thread a specific brand register across long answers, or a refusal policy that has to be uniform across many phrasings of out-of-scope queries. A few thousand labeled examples fine-tune the behavior to 99%+, and the prompt no longer needs the multi-shot demonstration block — which incidentally cuts input tokens by 60–80% per request, paying back the fine-tune cost in inference savings within weeks at high volume.

2. Latency-sensitive paths where retrieval is too expensive. The retrieval round-trip — embed-the-query, ANN-lookup, hydrate-the-chunks, append-to-prompt — is typically 50–200ms on a tight vector store and more if you have a reranker in the cascade. For sub-100ms response paths (voice assistants, real-time UI affordances, latency-critical user flows), the retrieval cost is the wrong primary lever; fine-tuning the model to absorb the narrow knowledge it needs for that specific path avoids the lookup entirely. The narrowness matters — this only works when the knowledge surface for the latency-sensitive path is small enough that the model can absorb it, which means a few hundred entities, not the full corpus.

3. High volume at the cost frontier. A workload that ships 100M+ requests/month is the right shape for fine-tuning a smaller open-weight model — a Llama 3.3 or a Qwen 2.5 — and self-hosting on dedicated GPUs. The break-even is usually around the $200k/year mark, where the headroom to operate your own model serving infrastructure pays back the cost of self-hosting versus per-token API pricing. The fine-tune is part of the migration plan — you can’t just swap in the smaller model cold; it needs to be fine-tuned on the workload’s task distribution to close the quality gap against the frontier model it’s replacing.

4. Distillation from a frontier teacher to a deployable student. Generate the training corpus by running a smaller share of your traffic through the frontier model (Sonnet 4.6, GPT-5.5, Opus 4.7), capturing input-output pairs, and fine-tuning your deployable target (Haiku 4.5, GPT-5.4-mini, or an open-weight model) on that corpus. This is the production pattern most teams converge on once the data flywheel is running: the frontier model is the labeler, the fine-tuned cheaper model is the deployment target, and the LLM-as-judge infrastructure validates that the student hasn’t drifted off the teacher’s quality envelope on the workload’s evals. Distillation as fine-tuning is structurally different from fine-tuning for behavior change — the goal is capability transfer, not behavior installation — but the mechanics are the same.

Outside these four shapes, fine-tuning is usually the wrong tool, even when the problem feels like it should be a fine-tuning problem.

When fine-tuning will silently fail you

The knowledge-injection trap is the most important failure mode to internalize. Two pieces of recent research land the same conclusion from different angles.

Mecklenburg et al. (2024), “Injecting New Knowledge into Large Language Models via Supervised Fine-Tuning” compared supervised fine-tuning to RAG on a controlled benchmark of post-training facts. Document-format training data (raw paragraphs) produced near-zero retention. QA-format training data — synthetic question-answer pairs over the same facts — performed substantially better but still underperformed RAG by a clear margin. The mechanism: the gradient signal in document training is dominated by language modeling loss across the whole text, not by the specific entity-relation-entity tuple you want stored; QA format concentrates the signal on the answer span, which works better but still falls short of explicit retrieval.

Soudani, Kanoulas, and Hasibi (2024), “Fine Tuning vs. Retrieval Augmented Generation for Less Popular Knowledge” made the long-tail story sharper: RAG outperforms supervised fine-tuning by a large margin for less-popular entities, and the gap widens as popularity decreases. Fine-tuning compresses the frequent patterns well and forgets the long tail; retrieval treats every entity equivalently and pulls the right tuple regardless of training-corpus frequency. The honest reading: if your application’s value is in handling the long tail accurately, fine-tuning is the wrong tool.

Companion work in the same vein found that unsupervised fine-tuning on documents (the cheapest fine-tuning shape) provided “only limited gains over base models” — barely outperforming the un-tuned model on factual recall. Supervised fine-tuning on task-specific data did better, but the conclusion remained: for inserting new knowledge into a model, RAG dominates the win-rate across model sizes and knowledge popularity bands.

Three more failure modes are worth naming explicitly.

The catastrophic-forgetting tax. A naively fine-tuned model loses general capability outside the fine-tuning distribution. The mitigation is LoRA/QLoRA adapters — low-rank updates that touch only a small fraction of the parameters, so the base model’s behavior outside the adapter’s task is largely preserved. The cost: adapter merging at inference, rank selection (typically 8–64 for behavior tasks), and a per-task adapter ZIP file you have to deploy alongside the base. Full-weight fine-tuning is reserved for cases where the behavior is so pervasive it justifies a full model fork; for most production behavior tweaks, LoRA is the default.

The eval-gap surprise. A team fine-tunes, sees 5–10% improvement on the offline eval, ships it, and watches production quality drop on slices the eval didn’t cover. The fine-tune optimized for the eval distribution and silently regressed on slices outside it. The mitigation is the eval-driven-development discipline: a properly stratified golden set with explicit coverage of the long-tail slices the fine-tune might harm, and a drift detector that catches the regression in production if the eval missed it. Fine-tuning without strong evals is sniping at the eval metric, not improving the workload.

The lifecycle-ownership debt. The fine-tune is a piece of code — it has a version, an owner, a CI pipeline, a deploy story, a rollback story, an eval suite, a re-train cadence, and a deprecation plan. Most teams that fine-tune in production are surprised by how much of this they have to build. The Hugging Face Hub and the Anthropic Console hide the build/serve plumbing; they don’t hide the ops cost of maintaining the artifact over a multi-year horizon. The honest cost model for a fine-tune includes the engineer-month of lifecycle work, not just the dollar cost of the training run.

The cost model that decides

The numbers are the most legible part of this article. Let me lay them out concretely as of May 2026.

Fine-tuning costs.

  • Anthropic Claude Haiku fine-tuning on Bedrock (only available in us-west-2, Claude 3 Haiku — the GA announcement is from November 2024 and remains the canonical entry point; Claude Haiku 4.5 is the current Bedrock model, with fine-tuning availability tracking behind general availability). Training cost: usage-based, roughly $50–$300 per fine-tuning run on a typical 10k–50k example dataset. Inference: pay the per-token price on top of a monthly “custom model” hosting fee.
  • OpenAI fine-tuning. The platform is winding down as of May 2026 — closed to new users, with existing users able to fine-tune GPT-4.1 and GPT-4.1-mini (SFT/DPO) and o4-mini (reinforcement fine-tuning, $100/hour of training time). The migration path published in OpenAI’s deprecation notes is to the agentic/Responses API stack rather than a successor fine-tuning product. If your roadmap depends on OpenAI fine-tuning, treat it as a sunsetting capability and plan accordingly.
  • Open-weight LoRA on your own GPUs. A LoRA of a 7B–13B base on 10k–50k examples runs $50–$500 per training run on a rented H100, depending on rank and epochs. QLoRA (4-bit quantized base, LoRA adapters in fp16) cuts the GPU requirement enough that a single 24GB GPU can fine-tune a 13B model, making it the right starter shape for early-stage exploration.
  • The hidden cost. Data curation. Across every production team that has shipped a fine-tune, the consistent report is that the engineer-time spent building the training corpus — sourcing examples, labeling, deduplicating, validating with judge runs, iterating with eval feedback — is 5–10× the dollar cost of the training run. Budget the data work at one engineer-month for a serious fine-tune.

RAG costs.

  • Embedding model. OpenAI’s text-embedding-3-large is $0.13/M tokens; Cohere embed-v4 and Voyage embed-3 sit in similar ranges. For a 1M-document corpus at 1k tokens/doc that’s $130 to embed once, plus deltas on updates.
  • Vector store. pgvector on a managed Postgres is the cheap path — included in your DB cost for small corpora, scales to ~10M vectors before HNSW build times start to matter. Managed services (Pinecone, Weaviate, Qdrant) charge a per-pod or per-vector hosting cost, typically $50–$500/month for production workloads in the 1M–10M vector range.
  • Per-query cost. Embed-the-query is ~$0.0001 per request at typical query lengths; ANN lookup is sub-cent per query on managed stores; the extra context the retrieved chunks add to the prompt is the meaningful per-query line — typically 1k–4k extra input tokens per RAG-augmented call, which at Haiku 4.5 prices ($1/M input) is $0.001–$0.004 per request.
  • The hidden cost. Chunking, indexing, hybrid-search tuning, reranking, evaluation. The retrieval system is non-trivial; building it well is also a meaningful engineering investment, just one that’s split across the inference path rather than concentrated at training.

The refresh-cost ratio. This is the number that does the most work in the decision. Per refresh, the cost ratio between fine-tuning and RAG updates runs roughly 10× to 100×. A RAG update is “re-embed the changed chunks, write to the index” — minutes of automated work. A fine-tune update is “regenerate the training data with the new facts, retrain, re-eval, redeploy” — hours-to-days of engineer time. If your knowledge changes weekly, the lifecycle math runs at 52× per year against fine-tuning. The break-even calculation: fine-tune wins on lifecycle cost only when the knowledge is stable enough that fine-tuning’s amortized cost-per-refresh falls below RAG’s per-query overhead at your traffic volume. For most workloads with any knowledge volatility, this never crosses.

The decision tree

The 2026 production default isn’t “fine-tune or RAG” — it’s a sequence that escalates in cost and complexity, with each step justified by an explicit failure of the previous one.

text
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
1. PROMPT ENGINEERING  →  Try first. Cheapest, fastest, most reversible.
   Failure mode that justifies escalation:
   - Quality plateau on evals you trust, despite system-prompt iteration,
     few-shot examples, and structured output coercion.
   - Token cost on the few-shot demonstration block becomes meaningful at scale.

2. RAG                  →  Add when the bottleneck is "the model doesn't know
                           our facts" or "the facts change too fast to bake in."
   Failure mode that justifies escalation:
   - Eval shows model still doesn't follow the application's format or voice,
     even with retrieved context — i.e. the gap is behavioral, not factual.
   - Retrieval latency dominates the latency budget on a path where it can't.
   - You've reached the long-tail floor where retrieval over your corpus is
     no longer recovering the precision you need.

3. FINE-TUNE (LoRA)     →  Add when the bottleneck is "the model doesn't behave
                           the way our application requires" or "we need
                           cheaper deployable artifacts for high-volume paths."
   Failure mode that justifies escalation:
   - Per-token cost on the frontier model is unsustainable and you've already
     routed and cached aggressively.
   - You need a self-hosted artifact for compliance, data-residency, or
     latency-floor reasons.

4. DISTILL              →  Add when you have a high-volume workload, a quality
                           bar set by a frontier model, and an eval suite that
                           can verify the student tracks the teacher.

This is essentially the order that Hamel Husain recommends: do as much prompt engineering as possible before you fine-tune, because the prompt is the cheapest tool and prompting well is also a great stress test of your eval suite. The 2026 consensus that ScalaCode, BigDataBoutique, and other recent guides converge on is also this sequence: prompt → RAG → fine-tune → distill. The order matters because each step is more reversible than the next, and each step’s right-or-wrong answer is detectable on an eval suite that’s been refined by all the previous steps.

The hybrid pattern: fine-tune behavior, retrieve facts

The production pattern that wins in 2026 is the hybrid one. The fine-tune installs the behavior, the retriever supplies the facts, the eval suite enforces both. Three concrete shapes.

Pattern A: Behavior-only LoRA + full RAG. Fine-tune a small open-weight model (Llama 3.3 70B, Qwen 2.5 32B) with a behavior-only LoRA — output format, voice, refusal language, tool-use shape. The training corpus is small, ~5k–20k examples generated from a frontier teacher (Sonnet 4.6) and validated by an LLM judge. At inference, retrieve from the live RAG corpus, build the prompt, call the fine-tuned model. The fine-tune is stable across knowledge updates because it doesn’t encode any knowledge; the RAG handles the volatile half cleanly.

Pattern B: Distilled student + thin behavioral fine-tune. For a high-volume workload where economics demand a cheap model. Run a sample of traffic through the frontier teacher, capture (input, output) pairs, fine-tune the student (Haiku-class or open-weight) on the corpus. The student picks up both the behavior and a meaningful chunk of the teacher’s reasoning quality on the workload’s distribution. Layer RAG on top to keep the knowledge surface live. The hardest engineering problem here is the eval — proving that the student is “good enough” requires a judge or rubric that can measure the workload’s specific quality dimension, which is the LLM-as-judge and eval-driven-development story applied in production.

Pattern C: Frontier with prompt engineering only. The default for early-stage products. Prompt engineering plus RAG plus prompt caching plus model routing usually gets you within 80% of where fine-tuning would land, at a tenth of the engineering cost. The honest test: if you can’t articulate which specific behavior the fine-tune is going to change, you’re not ready to fine-tune.

Code: a hand-rolled decision harness in Python

The skeleton below implements the decision logic at the boundary between the application and the model — it picks RAG-only, fine-tuned-with-RAG, or frontier-with-RAG per request, based on the workload’s classifier. The pattern composes with the model-routing primitive from the previous article. Dependencies: Anthropic SDK and pgvector.

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
# pip install anthropic psycopg[binary] pgvector
import os
from dataclasses import dataclass
from typing import Literal

from anthropic import Anthropic
import psycopg

client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

@dataclass
class Route:
    model: str               # the deployable model for this request
    rag: bool                # whether to retrieve context
    use_finetune: bool       # whether to call the fine-tuned variant


def classify_request(text: str) -> Literal["routine", "domain_heavy", "novel"]:
    """A real classifier; here a stub.

    routine     -> high-volume, behavior-bound, fine-tuned student wins
    domain_heavy -> needs RAG over the knowledge base
    novel        -> outside the training distribution, frontier + RAG
    """
    if "policy" in text or "doc" in text or "how do I" in text:
        return "domain_heavy"
    if len(text) < 80:
        return "routine"
    return "novel"


def pick_route(text: str) -> Route:
    kind = classify_request(text)
    if kind == "routine":
        # behavior-tuned student, no RAG cost on the latency-sensitive path
        return Route(model="claude-haiku-4-5", rag=False, use_finetune=True)
    if kind == "domain_heavy":
        # RAG carries the volatile knowledge; mid-tier model carries the synthesis
        return Route(model="claude-sonnet-4-6", rag=True, use_finetune=False)
    # novel -> escalate to the frontier with RAG safety net
    return Route(model="claude-opus-4-7", rag=True, use_finetune=False)


def retrieve(query: str, conn: psycopg.Connection, k: int = 6) -> list[str]:
    # Hybrid retrieval: embed the query, ANN lookup, lex fallback.
    # Embedding fetch elided for brevity; assume `query_vec` is computed.
    query_vec = embed(query)  # produces list[float] of length D
    with conn.cursor() as cur:
        cur.execute(
            """
            SELECT chunk_text FROM doc_chunks
            ORDER BY embedding <=> %s::vector
            LIMIT %s
            """,
            (query_vec, k),
        )
        return [row[0] for row in cur.fetchall()]


def serve(text: str, conn: psycopg.Connection) -> str:
    route = pick_route(text)
    system_prompt = SYSTEM_PROMPTS["finetuned" if route.use_finetune else "base"]
    user_block = text
    if route.rag:
        chunks = retrieve(text, conn, k=6)
        ctx = "\n\n".join(f"<source>{c}</source>" for c in chunks)
        user_block = f"<context>\n{ctx}\n</context>\n\n<question>{text}</question>"

    resp = client.messages.create(
        model=route.model,  # if use_finetune, this would be the deployed variant ID
        max_tokens=1024,
        system=system_prompt,
        messages=[{"role": "user", "content": user_block}],
    )
    return resp.content[0].text


SYSTEM_PROMPTS = {
    # The fine-tuned student already encodes the brand voice and format,
    # so the system prompt collapses to a one-liner. The base model needs
    # the full behavioral demo set.
    "finetuned": "Answer concisely in the established support voice.",
    "base": (
        "You are a customer support assistant for ACME. "
        "Speak in second person, lead with the answer, "
        "cite sources from <source> tags inline, and refuse "
        "out-of-scope requests with: 'That's outside what I can help with.'"
    ),
}

The decision logic is the spine; everything else is plumbing. The pattern’s key invariant: the behavior contract (system prompt, output shape) shifts based on whether you’re calling the fine-tuned variant, but the knowledge contract (RAG context) shifts based on the request’s domain density. The two axes are independent and the harness lets them be set independently per request.

Code: the TypeScript shape with the Vercel AI SDK

Functionally equivalent — same decision logic, same RAG step, idiomatic Vercel AI SDK usage. Dependencies: @ai-sdk/anthropic and a pg client (use postgres or pg).

typescript
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
// npm i @ai-sdk/anthropic ai postgres
import { anthropic } from "@ai-sdk/anthropic";
import { generateText } from "ai";
import postgres from "postgres";

const sql = postgres(process.env.DATABASE_URL!);

type Kind = "routine" | "domain_heavy" | "novel";

function classify(text: string): Kind {
  if (text.includes("policy") || text.includes("doc") || text.includes("how do I"))
    return "domain_heavy";
  if (text.length < 80) return "routine";
  return "novel";
}

type Route = { model: string; rag: boolean; useFinetune: boolean };

function pickRoute(text: string): Route {
  const kind = classify(text);
  switch (kind) {
    case "routine":
      return { model: "claude-haiku-4-5", rag: false, useFinetune: true };
    case "domain_heavy":
      return { model: "claude-sonnet-4-6", rag: true, useFinetune: false };
    case "novel":
      return { model: "claude-opus-4-7", rag: true, useFinetune: false };
  }
}

async function retrieve(query: string, k = 6): Promise<string[]> {
  const queryVec = await embed(query); // returns number[] of length D
  const rows = await sql<{ chunk_text: string }[]>`
    SELECT chunk_text FROM doc_chunks
    ORDER BY embedding <=> ${`[${queryVec.join(",")}]`}::vector
    LIMIT ${k}
  `;
  return rows.map((r) => r.chunk_text);
}

const SYSTEM = {
  finetuned: "Answer concisely in the established support voice.",
  base:
    "You are a customer support assistant for ACME. " +
    "Speak in second person, lead with the answer, cite sources from <source> tags inline, " +
    "and refuse out-of-scope requests with: 'That's outside what I can help with.'",
} as const;

export async function serve(text: string): Promise<string> {
  const route = pickRoute(text);
  let userBlock = text;
  if (route.rag) {
    const chunks = await retrieve(text);
    const ctx = chunks.map((c) => `<source>${c}</source>`).join("\n\n");
    userBlock = `<context>\n${ctx}\n</context>\n\n<question>${text}</question>`;
  }
  const { text: out } = await generateText({
    model: anthropic(route.model),
    system: route.useFinetune ? SYSTEM.finetuned : SYSTEM.base,
    prompt: userBlock,
    maxTokens: 1024,
  });
  return out;
}

The same invariants hold. The decision lives in pickRoute; the behavior contract lives in SYSTEM; the knowledge contract lives in retrieve. The harness composes with streaming, structured output, and tool use — none of which require changing the decision logic.

Trade-offs, failure modes, gotchas

The “we’ll fine-tune to learn our docs” trap. This is the most common mistake. The team’s instinct is that fine-tuning teaches the model their domain knowledge, the way training a junior employee teaches them the playbook. The mechanism is closer to “the model partially memorizes phrasings from the training corpus, with strong recency and frequency biases, and forgets the long tail.” For documentation, internal knowledge bases, support content, product catalogs — anything that’s content-shaped — RAG is the right primitive. Fine-tuning over RAG can sharpen behavior on the retrieved-context-handling step, but the knowledge itself belongs in the retriever.

The format-versus-content confusion. A team fine-tunes a model on (query, answer) pairs and discovers the model now produces answers in the format of the training data even on queries that should produce different formats. The fine-tune has installed format as a feature instead of as a contingent property. The mitigation is to vary the format in the training data deliberately when format is conditional on the query, or to install format separately through a structured-output schema that the fine-tune doesn’t override.

Prompt-cache invalidation. A fine-tuned model is its own model ID for caching purposes. Switching between the fine-tuned variant and the base model — even within the same workload — busts the prompt cache. Production routers that route between fine-tune and base are paying the cache-miss tax on every transition; the prompt-caching economics that worked for routing between Haiku and Sonnet don’t compose the same way across base and fine-tune. The mitigation is to not route across the fine-tune boundary on the same conversation — pick one for the session, stick with it.

The RAG-evals-cover-up. A team adds RAG on top of an underperforming model, the retrieval metrics improve, the evals get better, and the team concludes “RAG fixed our quality problem.” Sometimes it has. Sometimes it has hidden the quality problem by funneling the model into the small slice of the input distribution where the retrieved context does most of the work. The check is to evaluate the application end-to-end on a stratified slice that includes queries outside the corpus’s coverage — if quality collapses on those, RAG is masking the actual problem and you need to look at the model again.

The fine-tune-as-IP claim. “Our fine-tuned model is our IP” is true in a marketing sense and overstated in an engineering sense. A LoRA adapter trained on 5k examples of customer-support voice is replicable in a week by a competitor with comparable data hygiene. The actual IP is the training corpus — the curated, judge-validated, eval-aligned dataset — and the eval suite that proves the model meets the bar. The model artifact is the cheapest part of the stack to reproduce.

Drift between fine-tune and base. As Anthropic and OpenAI release new base versions, your fine-tune is increasingly behind the frontier. The decision to migrate the fine-tune to a newer base is a real engineering project — re-running training, re-evaluating, re-deploying — and it has to be scheduled deliberately. Teams that don’t schedule it end up running fine-tunes against models two generations old, paying for capability they could get cheaper from the current base.

The data-loop discipline. Both fine-tuning and RAG benefit enormously from production human-in-the-loop feedback — the data flywheel turns user thumbs-down and edits into either new training pairs (fine-tuning) or new chunks/queries to add to the corpus (RAG). Teams that ship without the feedback loop in place flat-line on quality regardless of which tool they picked. The feedback infrastructure is upstream of the fine-tune-vs-RAG choice.

Further reading from the field