Conversation Compaction: Keeping Long Sessions Alive
Conversation compaction in long agent sessions: reactive vs preemptive triggers, cache-aware deletion, circuit breakers, snapshot-rollback, journals.
A coding agent has been running for six hours on a database migration. Token count: 740,000 of an 800,000-token budget. The harness counts tokens before every call, but the counter has been quietly off by 2% because the tokenizer shipped in the SDK lags the production tokenizer by one version. The agent emits a long tool-result; the actual input exceeds the model’s hard limit; the call returns context_length_exceeded. The harness’s compaction pass fires reactively, sends a 740k-token request to a small summarizer, also gets context_length_exceeded because the summarizer’s window is smaller than the foreground model’s, and the session is wedged: it cannot proceed without compacting, and it cannot compact without proceeding. On-call gets paged at 3am. The lesson is not “we should have compacted earlier” — it is that compaction is the only operation in a long agent session that, when it fails, leaves no path forward, and so its orchestration is the most safety-critical surface in the harness.
Opening bridge
Yesterday’s piece on the agent harness named seven duties and noted that “a summarizer that rewrites the prefix invalidates every downstream cache.” That sentence was a placeholder for an engineering surface; today we open it. The context-compression article covered the mechanics of summarization — recursive, structured, verbatim, opaque — and how to measure compression-induced quality loss. This piece is the complementary half: harness-level orchestration of when to fire compaction in a long session, how to fire it without breaking the prompt cache, what to do when it fails, and whether you should be compacting at all versus running an append-only memory journal that sidesteps the operation. The two together are the full conversation-compaction picture; this one closes the Agents subtree.
Definition
Conversation compaction is the harness operation that, when triggered, reduces the size of the live conversation buffer in place so that the next model call fits. Three properties distinguish conversation compaction in a long session from the broader compression operation. It is in-place mutation of the live buffer, not an offline pass — the next call must succeed against the new buffer. It is non-optional once triggered: there is no path forward if it fails. And it is cache-disruptive by default: any rewrite invalidates the prompt-cache prefix from the rewrite boundary onward, so a naive implementation pays full prefill on every subsequent call. The orchestration question is how to fire it as rarely and cheaply as possible, and how to keep the session alive when it fails.
The distributed-systems parallels
Three load-bearing parallels, each different from the parallels in the compression article. That article drew log-compaction-as-mechanism; this one draws log-compaction-as-orchestration, generational garbage collection, and circuit breakers around a single point of failure.
Log compaction as orchestration. Kafka’s log compaction runs on a background thread, operates on a separate copy of the segment, swaps the new segment in atomically once written, and rolls back on failure rather than leaving a half-compacted segment in place. The conversation analogue is exact: run the summarizer on a separate model call, write to a staging slot, validate, then atomically replace the buffer span. A naive implementation that streams the summarizer’s output into the live buffer and then prunes leaves the system in a broken state if the summarizer truncates, errors mid-stream, or returns malformed JSON.
Generational garbage collection is the closest match for when compaction fires. The JVM’s G1 collector runs young-generation GC frequently (fast) and full GC rarely (stops the world). The same shape applies here: micro-compaction (drop redundant tool results, collapse repeated reads of the same file) runs often and is cheap; full compaction (summarize the middle of the buffer) runs rarely and is expensive. Claude Code’s harness ships exactly this distinction — micro-compact runs at ~60-70% utilization and selectively clears tool outputs; full auto-compact runs at ~95%. A harness with only the full-GC equivalent runs the expensive pass too often; one with only the young-gen equivalent runs out of context before the full pass fires.
Circuit breakers around a single point of failure. Long-running databases wrap their checkpointing thread in a watchdog: if the checkpoint hangs, the watchdog kills it and triggers a fallback. Conversation compaction needs the same discipline because — as the opening anecdote showed — a compaction pass that fails repeatedly takes the whole session down with it. The circuit-breaker pattern from the tool-use article ports over: count failures, trip after N, fall back to a degraded strategy. A 2026.3.2 bug report against a major coding-agent product describes exactly this failure — compaction timeouts deadlocked the session because there was no breaker; the user couldn’t even run /new because queued commands sat behind the timed-out compaction. The fix wasn’t to make compaction faster; the fix was to add the breaker that should have been there from day one.
Reactive vs preemptive triggers
Two philosophies, and most production harnesses end up running both.
Reactive compaction fires after the buffer crosses a watermark. Anthropic’s compaction beta is the cleanest example — the server detects when input tokens exceed the configured trigger (default 150,000), runs compaction inline, emits a compaction block, and continues. OpenAI’s server-side compaction via the Responses API (shipped February 11, 2026) is conceptually identical: pass context_management.compact_threshold and the server fires compaction when the rendered token count crosses it. Claude Code’s auto-compact at ~95% is the client-side version. The pros are operational simplicity: a single number, a single conditional. The cons: the user-facing turn that crosses the watermark pays 1-5s of compaction latency, and there’s no head-room — if the model emits a 50K-token tool result at 90K of a 100K budget, the compaction must succeed on the first try because there’s no room for retry tokens.
Preemptive compaction fires before the buffer is dangerous, by projecting whether the next turn will breach the budget after the model’s reply. The estimate is the current buffer size plus a conservative max_tokens bound on output plus the expected tool-result size from the most recent tool_use block. The pros: the user-facing turn never gets surprise latency, and there is always head-room for the model to reply. The cons: some compactions are wasted (the prediction was conservative), and the trigger requires a reasonably accurate token estimator plus policy about max output size, which often lives outside the core compaction module.
The defensible production pattern is preemptive as primary, reactive as fallback. Run preemptive at ~70% of effective context; keep reactive as a backstop at ~95% for cases where the preemptive estimator was wrong (which it will be — tool-result sizes are heavy-tailed). This is the generational-GC pattern ported to the trigger surface.
Cache-aware compaction: surgical deletion
The most important property of a production compactor is cache-awareness. Every buffer rewrite invalidates the prompt-cache prefix from the rewrite boundary onward, so a naive compactor that summarizes the entire history into a fresh block pays full prefill on every subsequent turn — roughly 10× the cost (cache reads bill at 10% of base on Anthropic and similar on OpenAI) and seconds of added latency. Cache-aware compaction is the load-bearing optimization that turns long sessions from expensive to feasible.
Three patterns ship in production:
Surgical tool-result deletion, system prompt and assistant turns preserved. The cheapest operation: identify tool-call/tool-result pairs where the result has since been superseded (e.g., five Read calls on the same file — keep the latest, replace the earlier ones with [result superseded; see turn N]). The system prompt, assistant reasoning, and the recent tail are unchanged; the cacheable prefix grows monotonically. This is what Anthropic’s clear_tool_uses_20250919 does at the API level and what Claude Code’s micro-compact does client-side. Compression ratio is modest (10-30%), but the cache hit rate stays high, and the operation can run frequently.
Append-only summary block, untouched tail. When micro-compaction isn’t enough: write a new “session summary” block at a stable position (right after the system prompt), and drop the messages it summarizes. The summary, once written, is treated as immutable for the next K turns — it doesn’t get re-summarized on every turn; the cache treats it as a stable prefix. The conversation tail after the summary keeps growing and benefits from incremental caching. The Anthropic beta’s pause_after_compaction option exists for this pattern: pause after the summary is generated, let the client preserve any instruction-oriented messages, then continue. Compression ratio is high (80-95%); cache invalidation cost is paid once per K turns rather than every turn.
Anti-pattern: summary-as-system-prompt mutation. Some harnesses fold the running summary into the system prompt, mutating it every cycle. This is the worst thing you can do for the cache — the system prompt is the most-cached prefix, and rewriting it every turn recomputes the entire prefix on every turn. The cost graph reads “we turned on caching” but the bill stays flat. The harness anatomy article flagged this; it bears repeating because every team seems to invent it independently.
Rule of thumb: compress as far from the cache-hit prefix as possible; rewrite as little of the prefix as possible; treat the summary block as an immutable record between compactions, not a running state.
Error recovery: circuit breakers and snapshot-rollback
Compaction can fail in five ways: the summarizer errors out (network, rate limit, 5xx); returns malformed JSON; returns valid JSON but with empty load-bearing fields (the worst silent failure); its output exceeds its own context window (the wedged-session bug from the opening); or the summarized buffer still exceeds the foreground model’s limit. Each needs a typed recovery path.
Snapshot-and-rollback for atomic compaction. Before mutating the live buffer, take a snapshot — a deep copy of the message array and the running summary. Run the summarizer against the snapshot, validate the output (JSON parse + schema check + load-bearing-field presence), and only on success atomically swap the new buffer in. On any failure, roll back to the snapshot. The snapshot is the same primitive long-horizon checkpointing uses, applied at a finer grain. The cost is small; the safety benefit is large — compaction never leaves the session half-rewritten.
Circuit breaker around compaction failures. Maintain a per-session counter of consecutive failures. Trip at N=3 (a defensible default; lower if throughput-bounded, higher if the summarizer is flaky). When tripped, do not retry compaction — fall back to a degraded strategy and emit a structured failure. The breaker’s purpose is to break the infinite-failure loop: a session that fails once is recoverable; a session that fails repeatedly while burning the full timeout window each time is a service outage. After M turns of cool-down, reset.
Lossy truncation as the last-resort fallback. When the breaker is tripped or the summarizer is unreachable, fall back to unconditional head-tail truncation: keep the system prompt, keep the most recent K turns, drop everything in the middle. This is the default eviction policy from short-term memory, used here as the fallback when the smarter policy fails. The agent loses semantic context but the session stays alive — and a degraded session is recoverable; a crashed one is not. The fallback should be loud: log the event, expose a UI indicator, mark the failure for downstream review.
The wedged-buffer escape hatch. For the case where the summarizer itself can’t fit the buffer in its context: either keep a secondary summarizer with a larger context window, or run a chunked-and-merge pass that splits the buffer in half, summarizes each half independently, then summarizes the summaries. The chunked-and-merge pattern is the same shape as map-reduce summarization and is the right recovery primitive when the foreground model’s context exceeds the summarizer’s.
Append-only memory journals: the architectural alternative
A radically different design ports the database log-vs-LSM-tree decision: instead of compacting the live buffer, run the live buffer at near-zero retention and write everything important to an external append-only memory journal, queried at retrieval time rather than loaded wholesale.
Mechanically: every turn, before the model call, the harness extracts decisions, files, errors, or load-bearing facts and appends them to a per-session journal (JSONL file, Postgres table, vector store — substrate-agnostic). The live buffer stays short — last 10-20 turns — by aggressive truncation. When the model needs earlier context, it queries the journal via a recall tool the harness exposes. The journal grows linearly; the live buffer is bounded by design.
The journal model has three properties worth naming. It replaces the orchestration-failure surface with a retrieval-quality surface: no compaction to fail, but a recall query whose quality determines whether the agent finds the relevant fact. It makes the trade-off explicit: the operator can inspect, query, and audit the journal — versus a summary block whose contents are at the summarizer’s discretion. And it plays well with sleep-time compute: the journal is exactly the artifact an offline consolidation pass needs.
The pattern shows up under several names in 2026 production systems. Doug Turnbull’s “give your coding agent a journal” is the cleanest articulation — an agent maintains a journal file in the working directory, one entry per significant action, queried when it needs to remember. OpenCode’s append-only journal blocks productize the idea with semantic search. LangGraph’s checkpoint-and-store split implements the architectural distinction at the framework level. The unifying claim: aggressive truncation plus a queryable external journal beats summarization-in-place for any task where audit and reproducibility matter more than the gist surviving in prose.
The journal wins for coding agents (audit trail needs to survive verbatim), compliance workflows (regulators want exact actions, not paraphrases), long-horizon research (the journal is the research log), and multi-session agents (journals are naturally cross-session). Compaction still wins for open-ended conversational agents where the gist is what matters, latency-sensitive interactive agents where the journal round-trip dominates, and single-session ephemeral workflows. The patterns are complementary; the production answer for most agents is both — a journal for durable state, compaction for the live buffer’s working set.
A preemptive, cache-aware compactor in Python
Realizes the full orchestration shape: preemptive triggering, snapshot-and-rollback, circuit-breaker recovery, cache-aware mutation. Uses the Anthropic SDK.
| |
The interesting parts aren’t the summarizer call — that’s the context-compression article’s domain — but the orchestration. The snapshot-and-rollback cannot leave the buffer half-rewritten. The breaker prevents the infinite-failure loop. The lossy-truncation fallback keeps the session moving. The chunked-and-merge escape hatch handles the wedged case. The preemptive trigger gates on projected, not on current buffer size, so head-room is reserved before the call.
Same shape in TypeScript with the Vercel AI SDK
The Vercel AI SDK’s prepareStep is the documented hook for conversation compaction in AI SDK 5; it runs before each model call and can rewrite the messages array. The orchestration shape ports over without modification.
| |
The orchestration is provider-agnostic: the trigger lives in the harness, the summarizer is one model call like any other, the breaker and snapshot are pure state machines. Only the SDK integration differs — prepareStep in the Vercel AI SDK, manual buffer reconstruction in the raw Anthropic SDK.
Trade-offs, failure modes, and gotchas
Don’t compact during a tool-use cycle. A rewrite between a tool_use block and its matching tool_result leaves the API in an inconsistent state — modern providers reject the malformed sequence. Compaction fires between completed turns, never mid-turn; if the trigger fires while a result is pending, defer. Same tool-call/tool-result pairing invariant the short-term memory article named for truncation.
Tokenizer drift (the opening bug). Token counters lag the production tokenizer. If the harness ships its own, add a 5-10% margin to the stated context limit. Cheap defense: use the provider’s tokenizer API. Expensive defense: count server-side via usage blocks. Never trust a third-party tiktoken or claude-tokenizer library exactly — production tokenizers update.
Summarizer-context smaller than foreground. Common config mistake: foreground is Opus with 1M-token context, summarizer is Haiku with 200K. When the buffer crosses 200K, the summarizer fails because its own context is exhausted. Fix: pick a summarizer with at least the foreground model’s context, or ship the chunked-and-merge fallback as first-class. The OpenAI Agents SDK’s OpenAIResponsesCompactionSession sidesteps this by compacting server-side; client-side compactors don’t get that luxury.
The re-reading loop. The summary-paraphrasing bug, named in depth in the context-compression article, shows up here as the agent calling Read on a file it already edited because the summary paraphrased the path. Beyond the schema-enforced verbatim-identifier rule, the orchestration defense is to retain a parallel structured-index store (last file modified, last error seen, last decision) outside the compacted buffer and re-inject the relevant entries as pinned context. The journal pattern provides this naturally; the compaction-only pattern adds it explicitly.
Repeated compaction on every turn. If the soft watermark fires but compaction only achieves a small ratio, the next turn still projects over the watermark and compaction fires again. The session burns calls on back-to-back compactions. Fix: set a post-compaction target (compress until the buffer is ≤50% of the soft watermark) rather than running compaction once on crossing. Compress to the floor, not just below the ceiling.
Compaction during streaming. If the watermark is crossed mid-stream, do not interrupt to compact. Let the stream complete, then compact before the next turn. Interrupting a stream to mutate the buffer is the same bug family as compacting mid-tool-use.
Goodhart on trigger rate. Optimizing for “compactions fired per session” without measuring failed-compaction cost tempts the operator to lower the breaker threshold and fall back to truncation aggressively. Compactions-fired drops; session quality silently degrades because more sessions run on truncated context. Track post-compaction probe recovery jointly — does the agent still answer “what file did we edit?” correctly.
Anthropic’s compaction beta does most of this for you. When you can use compact_20260112, the server-managed checkpoint eliminates client-side boundary risk, the breaker is implicit, and the cache-aware rewrite is handled in the API. The client-side patterns in this article are for provider portability, proxies that don’t expose the beta, or trigger logic more sophisticated than threshold-crossing. Use the server-side primitive when you can; build the harness primitive when you can’t.
What to read next
- Summarization and Context Compression — the sibling piece. This article handles when and how to fire compaction in a long session; the compression article handles what the compaction operation actually does — recursive summarization, structured note-taking, verbatim compaction, opaque compression, and the quality-loss diagnostics.
- Anatomy of an Agent Harness — the runtime layer the compactor lives inside. Duty 5 (error recovery) and duty 4 (cache management) are the two duties this article specializes; the harness anatomy article is the integration view.
- Prompt Caching: Reusing the KV Cache Across Calls — the cost lever the cache-aware-compaction pattern is built around. Compaction is one of the few operations that can destroy a cache hit rate overnight; the prompt-caching article is the upstream context for why the cache-awareness discipline pays off.
- Long-Horizon Task Reliability — the broader recovery framework. The snapshot-and-rollback pattern, the circuit breaker, and the degraded-fallback discipline are all expressions of the same long-horizon-reliability primitives at a finer grain.
Further reading
- Anthropic — Compaction documentation — the reference for the
compact_20260112server-managed strategy. Thepause_after_compactionoption and the custom-instructions surface are the production levers worth knowing before you build a client-side compactor. - OpenAI — Compaction guide for the Responses API — the parallel server-side primitive (
context_management.compact_threshold), shipped February 11, 2026. The shape is conceptually identical to Anthropic’s beta; the trigger semantics differ slightly (token-count threshold vs. trigger-with-strategy). - Doug Turnbull — “Give your coding agent a journal” — the cleanest articulation of the append-only-journal alternative. The “journal-file-in-the-working-directory” pattern is the operational shape every coding-agent team eventually converges on, and the post explains why.
- Morph LLM — “Claude Code auto-compact: what triggers it, what it loses, how to fix it” — the production-incident perspective on auto-compaction. The empirical observations on micro-compact vs full-compact thresholds and the cataloged loss modes are the closest thing to a public post-mortem on a widely-deployed compactor.
- Phil Schmid — “The importance of Agent Harness in 2026” — the broader framing of the harness as the new competitive surface. The “build-to-delete” argument and the trajectory-data point are load-bearing background for any team deciding whether to invest in a custom compactor.