jatin.blog ~ $
$ cat ai-engineering/context-compression.md

Summarization and Context Compression

Context compression for LLM agents: recursive summarization, structured note-taking, measuring quality loss, and the log-compaction parallel.

Jatin Bansal@blog:~/ai-engineering$ open context-compression

A coding agent has been working on the same refactor for three hours. The conversation buffer is at 180,000 tokens — sixty tool calls, forty file reads, twenty edit attempts, the original task brief, and the user’s incremental clarifications scattered across the timeline. The harness fires its compaction pass at the watermark, drops the buffer to 30,000 tokens of summary, and continues. Five turns later the agent calls Read on src/auth/handler.ts, reads the file fresh, and announces it will “now implement the change requested earlier.” The user, watching, types: you already implemented that change two hours ago, in the commit you summarized as “auth refactor.” The agent has no record of the commit. The summary said “implemented auth refactor”; the file path is gone, the diff is gone, the decision rationale is gone. The conversation lost its grip on its own history at the moment it tried to keep it. This is the failure mode context compression exists to manage — and to get wrong is to silently substitute amnesia with confabulation for honest truncation. This article is the deep dive on the compression layer.

Opening bridge

Yesterday’s piece covered reflection — the write-time-deferred operation that reads a window of episodes and emits higher-order beliefs. Reflection generalizes; today’s operation compresses. The two are siblings on the maintenance axis but they answer different questions: reflection asks “what pattern do these episodes reveal?” and writes a new fact to the belief store; compression asks “how can I represent this span of conversation in fewer bytes without losing the load-bearing parts?” and writes a replacement for the original span. The short-term memory article named compaction as the eviction policy that drops from the middle rather than from the head or the tail; the write policies article named distillation as the per-turn shape transform that turns raw content into structured facts. Compression is the operation that handles the gap between them — what happens to a span of conversation that has already been written and admitted, but no longer fits in the working set the next call has to serve. It is the log-compaction layer of the memory subsystem.

Definition

Context compression is the operation that reads a span of conversation history or memory entries and replaces it with a shorter representation — a summary, a structured note, a compressed token stream, or a curated subset — that the next model call uses in place of the original. Three properties distinguish compression from its neighbors. First, it is replacement, not addition: the compressed output occupies the slot the source span used to occupy in the message array, the source is dropped from the live working set, and the replacement is what the model sees from then on. Second, it is lossy: every compression strategy throws away information; the design question is which information survives and which doesn’t. Third, it is triggered by budget pressure, not by content type: the compression pass fires because the buffer is approaching the context limit, the cost ceiling, or a latency budget — not because a specific fact is worth keeping. This is what separates compression from the extract step in the write pipeline, which fires on every turn regardless of pressure.

What compression is not. It is not truncation — truncation drops messages whole; compression rewrites them in a smaller form. It is not reflection — reflection generalizes across episodes to emit a higher-order claim; compression preserves the episodes themselves in a shorter form, without necessarily generalizing. It is not chunking — chunking sets retrieval granularity at write time; compression operates on what’s already in the buffer at runtime. The four together are the lossless / pattern / lossy / granularity primitives the memory subsystem operates with, and conflating them produces systems that nominally compress but actually do one of the other three.

Intuition

The mental model that pays off is log compaction with a structured key. Kafka’s log compaction is the canonical reference: a topic keeps the latest value per key, drops older values of the same key, and never drops messages without a key. The conversation analogue: each meaningful event in the buffer has an implicit key (the user’s task, the file being edited, the tool being called, the decision being made), and an old occurrence of the same key is superseded by a new one. A naive summarization pass that doesn’t know the keys produces a paragraph of text that reads like a human’s notes but loses the relational structure — the model can no longer answer “which file did we agree to skip?” because the answer has been blended into prose. A structured compaction pass that does know the keys preserves the key→latest-value mapping and discards everything else. The first reads beautifully and is operationally useless; the second reads like a config file and works.

Three design questions force themselves on every compression implementation. The first is what survives compression? Three families of answers: a prose summary (free-form natural language, high compression, low retrievability), a structured note (sections for intent / files / decisions / next-steps, medium compression, high retrievability), or a curated subset (drop some messages verbatim, keep others verbatim, no rewriting at all — Morph’s verbatim compaction). The second is when does compression fire? Three triggers: a watermark trigger (when the buffer crosses a threshold, e.g., 80% of the context window), a time trigger (every N minutes, every session-end), or a semantic trigger (when a topic shifts, when the model itself signals “let’s start fresh”). The third is how much survives? The compression ratio is the design variable that determines whether the compressed buffer is still useful — too aggressive and the next model call is confabulating from a too-thin summary, too gentle and you’ve spent the model call budget for no real reduction in working-set size.

The distributed-systems parallel

Three honest parallels, each load-bearing.

Log compaction is the closest match and the one every framework name-checks. Kafka’s compaction keeps the latest value per key and drops older ones; rollups in time-series databases keep aggregate statistics and drop raw points; database snapshots checkpoint state and discard the WAL prefix that’s been applied. The conversation buffer recapitulates all three: the latest-value-per-key pattern is what structured note-taking implements (the latest known file modification supersedes the older one), the aggregate-and-discard-raw pattern is what prose summarization implements (the gist survives, the verbatim turns don’t), and the checkpoint-and-truncate-WAL pattern is what Anthropic’s compaction beta implements (generate a summary block, drop all prior messages, continue from the summary). The trade-off between the three is the same trade-off Kafka users have spent a decade tuning: how much do you trust the key extraction relative to the raw stream?

The compression-vs-fidelity Pareto frontier is the same shape as image and video codec design. JPEG and H.264 settled this question years ago for visual media — the operator picks a quality level, the codec spends bits on the perceptually-load-bearing parts of the signal, and drops bits on the rest. Conversation compression is the same shape: the operator picks a compression ratio (or a model budget), the summarizer spends tokens on the load-bearing parts (file paths, error messages, decisions), and drops tokens on the rest (rephrased queries, model courtesies, retrieved-but-unused context). The cost of getting the perceptual model wrong in JPEG is artifacting; the cost in conversation compression is confabulation — the model fills in the gap with plausible-sounding nonsense. Both engineering problems converge on the same answer: measure what you preserve, not what you discard.

Materialized-view refresh policy ports directly from the reflection article. Reflection refreshes a belief table on a threshold; compression refreshes the message buffer itself on a threshold. The mechanics are the same: detect the threshold breach, run the expensive operation in the background or on the hot path, atomically swap the new representation in. The difference is the target: reflection materializes new entries in the long-term store; compression materializes a new working-set buffer that replaces the old one. The shared lesson: the refresh operation is where the cost lives, so the trigger policy is the dominant design variable. Anthropic’s beta lets you set the trigger from 50,000 to whatever your model supports; setting it too low burns model calls on every short session, setting it too high means you never actually compact and run out of context.

The four compression strategies that ship in production

The market has converged on four distinct strategies. Knowing them by name is what makes the framework choices downstream legible.

Strategy 1: Recursive (hierarchical) summarization

The seminal pattern, formalized in Wang et al.’s “Recursively Summarizing Enables Long-Term Dialogue Memory in Large Language Models” (2023, journal version 2025). The mechanism: maintain a running summary; on every N turns, prompt the model to regenerate the summary by combining the previous summary with the new turns; the regenerated summary replaces the old one. The recursion is in the update step — each new summary is a function of (previous_summary, new_turns), and the previous_summary is itself the recursive result of all earlier compressions.

The pattern’s claim to fame is holistic regeneration. Hard append-or-delete summary updates fragment the summary over many cycles; a regeneration update produces a coherent rewrite each cycle. The cost is one full model call per update, with the summary token count growing slowly over time (each new event adds a few tokens to the summary, then the regeneration prunes the older now-less-relevant parts).

When it’s right. Long-running conversations where the gist matters more than the verbatim history. Personal AI companions, customer-service follow-ups, the kind of agent where “what did we agree to do?” is the dominant read query, and “what was the exact line you said?” is rare.

When it’s wrong. Coding agents, debugging agents, anywhere the verbatim content of an earlier turn is what the next call needs (file paths, error messages, exact configuration values). A recursive summary will paraphrase src/auth/handler.ts:142 into “the authentication handler” and the agent will spend the next ten turns re-reading the codebase to find what it once knew.

Strategy 2: Structured note-taking (anchored sections)

The 2026 production answer to recursive summarization’s paraphrasing problem. Productized by Factory.ai’s anchored iterative summarization and adopted in shape by every coding agent’s harness layer. The mechanism: define a fixed schema for the summary — sections for session_intent, files_modified, decisions_made, pending_questions, next_steps — and prompt the summarizer to populate each section explicitly. The schema is the structure-vs-prose distinction: a prose summary can silently drop a file path; a section labeled files_modified cannot, because the model has to either fill it or explicitly leave it empty.

Factory’s published numbers from their evaluation on 36,000 production messages: structured summarization scored 3.70/5 overall on retention and accuracy, vs 3.44 for Anthropic’s automatic compaction and 3.35 for OpenAI’s /compact endpoint. The headline result they highlighted: technical-detail retention (file paths, error codes, line numbers) was meaningfully higher than for prose-only approaches.

When it’s right. Any agent where the load-bearing context is typed and enumerable — coding agents (files, lines, errors, decisions), data-analysis agents (datasets, columns, transformations), customer-support agents (issue, attempted fixes, escalations). The vast majority of production agents land here.

When it’s wrong. Open-ended conversational agents where the schema doesn’t pre-exist. A therapy-style companion or an open-domain Q&A doesn’t have natural section boundaries; forcing a schema produces awkward “Section: Intent — N/A” outputs that waste tokens. Recursive summarization or no-compression-with-aggressive-truncation is the better fit.

Strategy 3: Verbatim compaction (curated subset)

Morph’s verbatim compaction is the productized version: drop low-value tokens but keep high-value ones character-for-character. No rewriting, no paraphrasing, no model-generated text — the surviving content is byte-identical to the original input. The compression ratios are modest (50-70% vs the 80-90% of prose summarization) but the fidelity is perfect on the parts that survive. Reported numbers: 3,300+ tokens/sec and 98% verbatim accuracy.

The trade-off is in the deletion oracle. The pass needs to decide which tokens are low-value, and that decision is itself a model call — typically a smaller model classifying each message as “drop” or “keep.” The pass is fast in aggregate because the per-message decision is cheap, but the model has to be calibrated; a bad oracle deletes the wrong half.

When it’s right. Coding agents (the dominant use case Morph targets), debugging sessions, anywhere the agent will re-reference specific lines from the conversation later. The verbatim-preservation property is what makes the “re-reading loop” go away — the agent doesn’t have to re-discover what was already established.

When it’s wrong. Conversations dominated by long-form natural language where the gist matters more than any individual sentence. A long meeting transcript compressed by verbatim deletion still produces a long, awkward sequence of disconnected verbatim passages; a prose summary of the same transcript is much more useful as a working-set substitute.

Strategy 4: Opaque (model-internal) compression

OpenAI’s /v1/conversations/{id}/compact endpoint and similar opaque-compression interfaces produce a server-side compressed representation that the client never sees in human-readable form. The compression ratios are dramatic — Morph’s reporting cites 99.3% for OpenAI — because the representation isn’t bound by being human-legible. The compressed state is then attached to the next request as an opaque token.

When it’s right. When you don’t need to inspect what was kept and dropped, and you trust the vendor’s compression model. Extremely long contexts where the gain from 99% compression dominates the loss from inspection.

When it’s wrong. Auditable systems, multi-vendor portability, debugging-sensitive workflows. Opaque compression is the vendor-lock-in version of the trade-off — the operator gives up inspection and portability in exchange for raw compression ratio. Most production teams that have evaluated it (Factory’s published score: 3.35/5) end up preferring structured summarization for the inspectability.

Anthropic’s compaction beta — the API-level pattern

The cleanest reference implementation in 2026 is Anthropic’s compaction strategy. The strategy name is compact_20260112; the beta header is compact-2026-01-12; the trigger defaults to 150,000 input tokens. When the threshold is breached, the API runs the compaction inline, emits a compaction block in the response, and on the next request automatically drops all message blocks prior to the compaction block. The pattern is straightforward log-compaction with a server-managed checkpoint, exposed to the client as a single header and a configuration object. Custom summarization instructions can be passed in instructions to override the default prompt.

The shape worth internalizing — and the reason Anthropic ships this as a beta strategy rather than letting the client implement it — is that the server-managed checkpoint eliminates the client’s risk of getting the boundary wrong. A client-side compaction has to maintain its own “compacted prefix” pointer; if the pointer drifts (or the client retries a request and the retry runs on the un-compacted version), the model sees two copies of the same content or loses the compaction entirely. The server-managed version eliminates that bug class.

The mechanics: when does compaction fire?

The single biggest cost lever in compression is when it runs.

Reactive (watermark-triggered). The harness counts tokens before each turn; if the count exceeds a watermark, run the compaction pass before the next model call. Pros: only runs when needed, minimal wasted model calls. Cons: the user-facing turn that triggers the watermark pays the compaction latency (typically 1-5 seconds with a small model). This is the default in LangGraph’s SummarizationNode (compresses when max_tokens_before_summary is exceeded) and in Anthropic’s beta.

Preemptive (budget-aware). Before each turn, predict whether the current turn will breach the budget after the model’s reply, and if so, run the compaction before the turn rather than after. Requires a token-count estimate for the model’s output (use a conservative estimate: the model’s max_tokens parameter, or a 95th-percentile output length). Pros: the user-facing turn never gets surprise-latency from compaction. Cons: extra model calls when the prediction is wrong (compaction runs but wasn’t actually needed). Best for high-stakes interactive agents where the unexpected latency is more expensive than the occasional wasted compaction.

Deferred (session-end or background). Compaction runs at session close or in a sleep-time / background pass over recorded sessions. Best when the conversations are bounded and the compressed state is consumed by future sessions, not the current one. The classic use case is multi-session memory: the live session uses the raw buffer; the cross-session memory layer pulls the compacted version. This is the pattern LangMem’s summarize_messages is designed for.

Semantic (topic-shift-triggered). Compress when the topic shifts — when a tool’s output indicates a task transition, when the user introduces a new objective, when the agent itself signals “starting fresh.” The trigger is content-aware rather than budget-aware. Pros: compaction lands at natural boundaries; the compressed summary is a coherent block. Cons: requires a topic-detection signal that itself costs something. Useful as a secondary trigger on top of a watermark — if the watermark fires and a topic shift is detected, compress; if only one fires, hold.

The defensible production answer is watermark as primary, semantic as secondary: maintain a token watermark at 70-80% of the model’s effective context, fire compaction when crossed, and use semantic signals to nudge the boundary to a coherent point. The conversation-compaction article works the harness-level orchestration design — reactive vs preemptive triggers, cache-aware deletion, circuit breakers, snapshot-and-rollback — in more detail; the takeaway here is that compression is a budget-driven operation, and the trigger is the dominant lever.

The runnable implementations

Both implementations realize the same recursive-summarization-with-structured-output pattern: a running summary stored separately from the buffer, an LRU-style tail of recent messages kept verbatim, and a structured-section prompt that forces the summarizer to populate each load-bearing slot explicitly. The patterns scale to richer schemas (Factory’s full set is intent, files_modified, decisions_made, pending_questions, next_steps); a five-section schema is a defensible starting point.

Python — recursive structured-section compaction

Uses the Anthropic SDK for the summarization call. The implementation is hot-path-blocking but the same shape ports to a background queue.

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
# pip install "anthropic>=0.40.0"
import json
from typing import Any
from anthropic import Anthropic

client = Anthropic()
SUMMARIZER_MODEL = "claude-haiku-4-5"  # $1/$5 per million tokens

SUMMARY_PROMPT = """\
You are compacting an agent conversation. Read the prior summary (if any) and the new messages, then emit an updated structured summary with EXACTLY these JSON keys. Every key must be present. Use [] for empty lists, "" for empty strings, never omit a key. Never paraphrase technical identifiers (file paths, error codes, function names) — copy them verbatim from the source.

{{
  "session_intent": str,            // one sentence: the user's overall goal
  "files_modified": [str],          // list of file paths touched, in order
  "decisions_made": [str],          // bulleted decisions with rationale
  "pending_questions": [str],       // unresolved questions from the user
  "next_steps": [str]               // what the agent plans to do next
}}

PRIOR SUMMARY:
{prior_summary}

NEW MESSAGES:
{new_messages}

Respond with the JSON object only, no prose.
"""

def render_messages(messages: list[dict]) -> str:
    return "\n\n".join(
        f"[{m['role'].upper()}] {m['content']}" for m in messages
    )

def compact(prior_summary: dict | None, new_messages: list[dict]) -> dict:
    prior_text = json.dumps(prior_summary or {
        "session_intent": "", "files_modified": [],
        "decisions_made": [], "pending_questions": [], "next_steps": [],
    }, indent=2)
    prompt = SUMMARY_PROMPT.format(
        prior_summary=prior_text,
        new_messages=render_messages(new_messages),
    )
    resp = client.messages.create(
        model=SUMMARIZER_MODEL,
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}],
    )
    text = resp.content[0].text.strip()
    # Strip code-fences if the model wrapped the JSON
    if text.startswith("```"):
        text = text.split("```")[1].lstrip("json\n").rstrip("`\n")
    return json.loads(text)

# ----- The buffer manager that uses it -----
TAIL_KEEP = 8         # last N messages kept verbatim
WATERMARK_TOKENS = 100_000  # compress when this is exceeded

class CompactingBuffer:
    def __init__(self) -> None:
        self.system_prompt: dict | None = None
        self.summary: dict | None = None
        self.tail: list[dict] = []   # uncompacted recent messages

    def append(self, message: dict) -> None:
        self.tail.append(message)
        if self._approx_tokens() > WATERMARK_TOKENS:
            self._compact()

    def _approx_tokens(self) -> int:
        # ~4 chars per token is a defensible rough estimate
        body = json.dumps(self.summary or {}) + render_messages(self.tail)
        return len(body) // 4

    def _compact(self) -> None:
        # Compress everything except the last TAIL_KEEP messages
        if len(self.tail) <= TAIL_KEEP:
            return
        to_compress = self.tail[:-TAIL_KEEP]
        self.summary = compact(self.summary, to_compress)
        self.tail = self.tail[-TAIL_KEEP:]

    def render_for_model(self) -> list[dict]:
        msgs: list[dict] = []
        if self.system_prompt:
            msgs.append(self.system_prompt)
        if self.summary:
            msgs.append({
                "role": "user",
                "content": (
                    "PRIOR SESSION SUMMARY (structured):\n"
                    + json.dumps(self.summary, indent=2)
                    + "\n\nContinue the conversation below."
                ),
            })
        msgs.extend(self.tail)
        return msgs

A few decisions worth noting. The summarizer is a small model (Claude Haiku 4.5 at $1/$5 per million tokens); using the foreground agent’s model for compaction is the most common cost bug in production harnesses. The watermark check is on every append, but the compaction only runs when the buffer actually exceeds the threshold and there are enough tail messages to compress. The summary is rendered into the next prompt as a user message containing structured JSON, not as a system-prompt mutation — keeping the system prompt static preserves the prompt cache prefix, which is the second-most-common cost bug.

TypeScript — same shape with the Vercel AI SDK

Uses the Vercel AI SDK (ai package) with the Anthropic provider for portability.

typescript
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
// npm install ai @ai-sdk/anthropic zod
import { generateObject } from "ai";
import { anthropic } from "@ai-sdk/anthropic";
import { z } from "zod";

const SummarySchema = z.object({
  session_intent: z.string(),
  files_modified: z.array(z.string()),
  decisions_made: z.array(z.string()),
  pending_questions: z.array(z.string()),
  next_steps: z.array(z.string()),
});
type Summary = z.infer<typeof SummarySchema>;

const SUMMARIZER = anthropic("claude-haiku-4-5");
const TAIL_KEEP = 8;
const WATERMARK_TOKENS = 100_000;

type Message = { role: "user" | "assistant" | "system"; content: string };

const renderMessages = (msgs: Message[]) =>
  msgs.map((m) => `[${m.role.toUpperCase()}] ${m.content}`).join("\n\n");

async function compact(
  prior: Summary | null,
  newMessages: Message[],
): Promise<Summary> {
  const priorText = JSON.stringify(
    prior ?? {
      session_intent: "",
      files_modified: [],
      decisions_made: [],
      pending_questions: [],
      next_steps: [],
    },
    null,
    2,
  );
  const prompt = `You are compacting an agent conversation. Update the structured summary by combining the PRIOR SUMMARY with the NEW MESSAGES. Never paraphrase technical identifiers (file paths, error codes, function names) — copy them verbatim.

PRIOR SUMMARY:
${priorText}

NEW MESSAGES:
${renderMessages(newMessages)}`;

  const { object } = await generateObject({
    model: SUMMARIZER,
    schema: SummarySchema,
    prompt,
  });
  return object;
}

export class CompactingBuffer {
  systemPrompt: Message | null = null;
  summary: Summary | null = null;
  tail: Message[] = [];

  async append(message: Message): Promise<void> {
    this.tail.push(message);
    if (this.approxTokens() > WATERMARK_TOKENS) {
      await this.compact();
    }
  }

  private approxTokens(): number {
    const body =
      JSON.stringify(this.summary ?? {}) + renderMessages(this.tail);
    return Math.ceil(body.length / 4);
  }

  private async compact(): Promise<void> {
    if (this.tail.length <= TAIL_KEEP) return;
    const toCompress = this.tail.slice(0, -TAIL_KEEP);
    this.summary = await compact(this.summary, toCompress);
    this.tail = this.tail.slice(-TAIL_KEEP);
  }

  render(): Message[] {
    const msgs: Message[] = [];
    if (this.systemPrompt) msgs.push(this.systemPrompt);
    if (this.summary) {
      msgs.push({
        role: "user",
        content:
          "PRIOR SESSION SUMMARY (structured):\n" +
          JSON.stringify(this.summary, null, 2) +
          "\n\nContinue the conversation below.",
      });
    }
    return msgs.concat(this.tail);
  }
}

Same shape, same trade-offs. The TypeScript version uses generateObject from the Vercel AI SDK to bind the schema directly to the call; the Anthropic SDK version achieves the same constraint via the prompt-anchored JSON contract. Both rely on the same insight: the structured output is what forces the summarizer to populate each load-bearing section, and the structured-output article’s pattern of schema-coerced extraction ports over without modification.

Measuring compression-induced quality loss

Compression is the operation that’s most-often shipped untested, and the failure mode is silent. Three diagnostics worth running before any compression strategy goes to production.

The probe test. Before compression, the buffer contains a set of probes — verifiable facts that the next call could plausibly need (file paths, decisions, configuration values, user-stated preferences). After compression, query the resulting buffer (via a small model call) and check whether each probe is recoverable. The probe-recovery rate is the headline metric. Factory’s published evaluation uses probe-based scoring across four dimensions (factual retention, file tracking, task planning, reasoning chains); the framework is general and worth porting to any custom summarizer.

The re-reading-loop diagnostic. After compression, count how many turns it takes the agent to re-discover facts that were in the pre-compression buffer. If the agent calls Read on a file it has already edited, or asks the user for a clarification it has already received, the compression dropped a load-bearing piece. The instrumentation is cheap (compare post-compression tool calls to pre-compression tool calls for the same agent on the same task) and the signal is direct. Morph’s pitch for verbatim compaction is grounded in this metric: a re-reading loop is what you avoid when you preserve byte-for-byte content.

The confabulation diagnostic. Run two versions of the next call — one with the compressed buffer, one with the uncompressed buffer — and compare the responses. Disagreements on the facts (which file, what value, which decision) are confabulation; disagreements on the style are acceptable. The diff-based diagnostic is what catches the failure mode where the agent confidently says the wrong thing because the summary said the wrong thing. Run it offline on logged sessions; the cost is one extra model call per probe session, and the signal is high.

These three together — probe-recovery, re-reading-loop, confabulation diff — are the closest the compression layer comes to having a unit-test discipline. The RAG evaluation article covers the broader testing framework for memory-side retrieval; the same pattern (golden set + probe queries + scored outputs) ports over.

Trade-offs, failure modes, and gotchas

The cache-invalidation cost. Every compression operation rewrites the buffer, which invalidates the prompt cache prefix from the compacted boundary onward. A naive harness that compresses on every turn pays the full prefill cost on every subsequent call. The mitigation is to bias compaction toward the head of the buffer: compress the older content, preserve the recent tail and the system prompt unchanged. The summary block, written once, then stays cache-stable for many turns. The cache-aware-compaction pattern — minimize the cache-invalidation surface area of the rewrite — is the load-bearing optimization in any production compaction layer.

The paraphrasing-of-identifiers bug. The most common quality bug. A summarizer asked to “summarize the conversation” will paraphrase src/auth/handler.ts:142 as “the authentication handler at line 142” or, worse, as “the authentication handler”; the agent then loses navigation. The mitigation is in the prompt — the explicit instruction “never paraphrase file paths, error codes, function names, or version numbers; copy them verbatim from the source” — combined with a structured schema where the identifier-bearing fields (files_modified, error_codes_seen) are typed as arrays of literal strings. Without both, the summarizer drifts to natural-language fluency at the cost of verbatim accuracy.

The compounding-error cliff. Recursive summarization regenerates the summary on each cycle; each regeneration is one more chance for the model to drop a fact, paraphrase an identifier, or misattribute a decision. By cycle ten, the summary is several generations removed from the original turns, and the error rate has compounded. The mitigation is to keep the raw source spans available in a side store and re-summarize from raw periodically — typically every K cycles, re-generate the summary from the original raw turns rather than from the previous summary. This breaks the compounding chain at the cost of one expensive re-summarization every K cycles.

The compression-induced amnesia confusion. A user references a fact from earlier in the session (“the bug I told you about at the start”); the agent doesn’t have it in the buffer and asks for clarification. The user, having said it once already, is annoyed. The mitigation is to expose the compression state to the user — UI-side indicators like “I compressed our earlier conversation; let me re-load that detail” make the failure mode legible rather than silent. The OpenAI Agents SDK’s OpenAIResponsesCompactionSession exposes the compression hooks for exactly this reason.

The tool-call-tool-result orphan. A compression pass that drops a tool call but keeps its tool result (or vice versa) leaves the model with an orphan message; modern APIs reject the malformed sequence. The fix is the same as short-term memory’s tool-call/tool-result pairing invariant: treat the pair as atomic at compression time. The structured summary should describe the pair as a unit (“ran Read('src/auth.ts') and obtained the file contents”) rather than trying to keep one without the other.

The summary-as-system-prompt anti-pattern. Some harnesses fold the running summary into the system prompt, mutating it every cycle. This is two failure modes at once: it invalidates the prompt cache (the system prompt is the most-cached prefix), and it conflates instructions with state. The defensible pattern is to keep the system prompt immutable and place the summary as a separate user-role message immediately after — preserving the cache and keeping the two abstractions separate.

The Goodhart-on-compression-ratio failure. Optimizing for compression ratio (tokens-saved) without measuring quality loss is the canonical Goodhart’s-law trap: the operator tunes the summarizer to compress more aggressively, the compression ratio improves, the agent silently degrades. The mitigation is to always measure the ratio jointly with a quality metric (probe recovery or downstream task success) and reject any compression configuration that improves ratio while degrading quality. Compression is one of those operations where the cheap-and-obvious metric is the wrong one to optimize for. This is also why a compaction pass is the foundation of the long-horizon MOP-restart pattern — a meltdown recovery that lossy-compresses the wrong fact reseeds the same hallucination cascade with a clean context window in front of it.

The blind-compaction-of-instructions bug. A user mid-session says “from now on, never use TypeScript enums.” The instruction has the shape of a turn, gets compressed into the summary, gets paraphrased as “user has style preferences about TypeScript.” Three turns later the agent emits a TypeScript enum. The fix is to route durable instructions to a separate channel — extract them at write time (the write-policies pattern), pin them to the system prompt or a system-prompt-adjacent block, and never let them be lossy-compressed alongside conversation. Compression is for content; durable instructions are facts and belong in the write-policy-managed tier, not in the summary.

When compression beats other strategies — and when it doesn’t

Compression wins when the conversation contains a long mid-section of low-value turns surrounded by high-value endpoints. Coding sessions with extensive tool-call cycles, debugging sessions with many false starts, exploratory analysis sessions — all have the shape “the recent tail and the original task are load-bearing, the middle is mostly retrieved-but-unused context.” A summary of the middle preserves the relevant facts at a fraction of the token cost, and the working set fits comfortably.

Compression wins when the next call’s working set is a small subset of the full history. If the agent only needs to remember “what we decided” and “where we are now,” the full history is overkill. A small structured summary substitutes for the history at one or two orders of magnitude lower token cost.

Truncation beats compression for short sessions. If the conversation is going to end in another 5-10 turns regardless, the cost of a compression pass (a model call, plus the cache invalidation, plus the latency) outweighs the cost of just truncating. Compression is a long-session strategy; in a short session, simple head-tail truncation wins.

Verbatim compaction beats summary compression when verbatim content is load-bearing. Coding agents are the canonical example. Most teams running coding agents in production have converged on verbatim or structured compaction for this reason; the prose-summary approach is the simpler-and-prettier loser.

Long-context models reduce the urgency, but don’t eliminate it. Models now ship with 1M-token contexts; the immediate budget pressure for compression is lower. But the cost per call scales linearly with context size, and the lost-in-the-middle effect means an over-stuffed long context is worse than a well-curated short context. Compression remains the right operation even when the hard limit is no longer binding — the soft limit (attention dilution, cost per call, latency) shows up sooner than the hard one.

Further reading

  • Recursively Summarizing Enables Long-Term Dialogue Memory in Large Language Models — Wang et al., 2023 (journal version 2025) — the canonical reference for the recursive-summarization pattern. The mechanism (regenerate the entire summary holistically on each update, rather than appending or deleting) is what every production summarizer either ports or knowingly diverges from. The journal version’s Neurocomputing publication extends the analysis with multi-domain dialogue evaluation.
  • Evaluating Context Compression for AI Agents — Factory.ai, 2026 — the structured-summarization comparison with Anthropic’s and OpenAI’s strategies on 36,000 production messages. The probe-based evaluation framework is the most rigorous public methodology for measuring compression-induced quality loss, and the anchored-iterative-summarization pattern is the production blueprint for any coding-agent harness.
  • Compaction vs Summarization — Morph LLM, 2026 — the verbatim-compaction perspective on the same problem. The argument for byte-for-byte preservation over prose rewriting is well-articulated and the trade-off frame (compression ratio vs fidelity vs inspectability) is the right one to internalize before picking a strategy.
  • Anthropic Compaction Documentation — the API-level reference for compact_20260112. Worth reading alongside the context-editing docs — the two are complementary primitives (compaction summarizes; context-editing surgically clears tool results) and a defensible harness uses both.
  • LangMem Summarization Guide — the framework-level patterns for summarize_messages and SummarizationNode. The running_summary state propagation is the load-bearing detail that distinguishes a real recursive summarizer from a naive “summarize-from-scratch-every-time” implementation.
  • Conversation Compaction: Keeping Long Sessions Alive — the sibling article on the harness orchestration of compaction in a long-running session. This piece is the mechanics of summarization; that one is the when, how-safely, and what-to-do-when-it-fails — reactive vs preemptive triggers, cache-aware surgical deletion, circuit breakers, snapshot-and-rollback, and append-only memory journals as the architectural alternative.
  • Reflection: From Experiences to Beliefs — the sibling maintenance operation. Reflection generalizes across episodes to emit higher-order beliefs; compression preserves the episodes themselves in a shorter form. Read the two together to see the full maintenance axis of the memory subsystem.
  • Memory Write Policies: What’s Worth Remembering — the upstream gate. Distillation at write time reduces the volume of content that ever needs compressing; a tight write policy is the first line of defense against unbounded buffer growth.
  • Sleep-Time Compute and Memory Consolidation — the regime compression should run in. Compaction is the heaviest operation in the memory subsystem; running it on the hot path is rarely defensible. Sleep-time compute is the architectural answer — idle-time scheduling, cheap-model consolidation, and the cost math that turns deferred compression from “a v2 nice-to-have” into the default production pattern.