$ cat ai-engineering/memory-write-policies.md

Memory Write Policies: What's Worth Remembering

Memory write policies: distillation, write amplification, the journaling-vs-checkpoint trade-off, learned classifiers, and admission control for agents.

Jatin Bansal@blog:~/ai-engineering$ open memory-write-policies

A coding agent is six months into a long-running engagement with one user. The long-term episodic store holds 47,000 entries — every user turn, every tool result, every assistant reply, faithfully appended on every call. The agent feels worse than it did at month two. Retrieval pulls textually similar episodes that are no longer relevant; the recency × importance × similarity rerank is sorting through a haystack that is now 95% noise; the hierarchical memory hot tier gets pre-paged with stale facts because the warm tier’s top-K is dominated by old write-spam. None of the storage layers are broken. The substrate works. The substrate is too good at storing everything, and the team never built a defensible answer to the upstream question: which writes should ever have entered the store in the first place? That question is what memory write policies answer, and it is the layer that decides whether every other piece of the memory stack pays off. This article is the deep dive.

Opening bridge

The long-term memory piece sketched three write policies in passing — a heuristic gate, an LLM-classified gate, and a deferred session-end summarization — and flagged that the Mem0 paper attributes most of its measured recall lift to aggressive write-time filtering. The knowledge-graphs piece added a different write concern: entity resolution at write time as a structural problem distinct from filtering noise. The hierarchical-memory piece noted that “demotion is never deletion” — the demotion path needs the same care as the promotion path. Today’s article steps back from any single substrate and treats the write path as a system-level concern: an admission-control layer that sits in front of every memory tier, decides what gets in, distills it before it lands, and reserves the cost of expensive normalization (entity resolution, conflict checks, embedding) for content that has cleared the gate. This is the layer that, in production, either makes the rest of the stack feel like memory or quietly turns it into a write-only landfill.

Definition

A memory write policy is the set of rules that decides, for each candidate piece of information the agent could remember, (a) whether to write at all, (b) what shape to write — raw episode, distilled fact, or both — (c) which tier to write to, (d) when to write — synchronously on the hot path or deferred — and (e) what to do on conflict with existing content. Five orthogonal decisions, every one of which has a defensible default and a more thoughtful tuned answer. Three properties separate a policy from a handler. First, it is applied uniformly — every write goes through the same gate, with the same evaluation, so the corpus has predictable statistical properties. Second, it is bounded in cost — the policy itself has a budget (latency, dollars, model calls per write) that does not scale with the size of the existing store. Third, it is auditable — the agent (and the operator) can answer “why is this in the store?” and “why is this not in the store?” from explicit policy outputs, not from a model’s hidden reasoning.

What a write policy is not. It is not the chunking decision (chunking sets the granularity of stored units; the write policy decides whether a unit enters at all). It is not the read-time rerank from the long-term memory piece (the rerank can filter noise if it is already in the corpus; the write policy prevents the noise from entering). It is not the maintenance pass — reflection, sleep-time consolidation, conflict resolution — that runs over the store later (those operate on what was written; the policy decides what is written). The four together form the memory subsystem’s full lifecycle; the write policy is the front door.

Intuition

The mental model that pays off is admission control on a write-ahead log with bounded query budget. The episodic store is the log; the queries are read-side recall passes; the budget is the model’s effective working set in the prompt. The log itself is cheap to grow — storage is virtually free — but every append degrades the signal density of subsequent reads, because the read path retrieves a fixed top-K from a growing population. Unbounded admission is the slow-poison version of write amplification: each individual write costs nothing, the harm is invisible at write time, the cumulative harm is fatal to the read path months later.

The clearer frame is every write is a tax on every future read. A naive policy treats the write as the responsibility — “I observed something; should I store it?” — and asks the question in isolation. The defensible policy treats the read as the responsibility — “I will run K queries against this store over its lifetime; does the marginal value of this entry across those K reads exceed the marginal precision loss it imposes on the others?” Both questions are LLM-decidable; only the second produces a policy that survives the corpus crossing 10k entries.

Two concrete design questions force themselves on every implementation. The first is what unit are we deciding on — a raw turn, a user/assistant exchange, a session, or an extracted fact lifted from one of those? The unit-of-decision question is upstream of every other policy choice and is the same question the long-term memory piece named the unit-of-recall problem; the write policy and the read granularity have to agree, or the gate is evaluating one thing and the retrieval is searching another. The episode segmentation piece is the deep dive on actually picking that unit at write time — the five segmentation signals, the cognitive-neuroscience grounding, and the event-sourcing aggregate-boundary parallel that tells you when the segment is the right size. The second is when does the policy run — synchronously inside the agent’s turn (admission control), deferred to session-end (batch admission), or in a background pass that processes the raw episodic log into a distilled tier (asynchronous admission)? Each of the three has a different cost profile and a different fidelity ceiling.

The distributed-systems parallel

Three honest parallels, each load-bearing.

The write classifier is admission control on a queue. Production message brokers have spent decades fighting the same shape of problem: producers can flood the system; downstream consumers have bounded throughput; the only correct answer is to apply admission control at ingress, drop or shape work before it enters the pipeline, and never let the producer’s rate set the consumer’s load. Memory writes are the same shape: the agent’s per-turn output is the producer; the read path’s top-K bandwidth is the consumer; an unbounded admission policy moves the failure from “the producer hits a backpressure signal” to “the consumer silently degrades over time.” The mature admission policies — heuristic prefilter, classifier-gated admission, token-bucket rate limit on writes per user per session — all port directly. The agent-budgets-and-runaway-prevention article covers the parallel runaway-prevention story for the loop itself; the write policy is the budget for the memory subsystem specifically.

Journaling versus checkpoint is the write-shape question. Databases ship two complementary write paths: the WAL (write-ahead log) appends every state-changing operation verbatim for durability and replay, and the checkpoint flushes a coalesced snapshot of state to long-term storage at intervals. Postgres, MySQL/InnoDB, RocksDB — all of them do both, and the trade-off between them is redo time vs read cost. WAL-only systems can replay arbitrary history but pay every read against the journal’s growing tail; checkpoint-only systems read fast but lose audit history. Production database engines run both because each pays for itself in a different read pattern. Agent memory recapitulates this exactly: the episodic store is the journal (every meaningful turn, append-only, replayable), and the semantic/working-memory-extracted store is the checkpoint (distilled facts, deduped, indexed for fast read). A write policy that only journals (Letta’s recall tier, raw vector store) reads slowly and remembers everything; a write policy that only checkpoints (Mem0’s distilled-fact pipeline as the only tier) reads fast and loses the audit trail. The mature 2026 production frameworks run both — journal the raw turn, checkpoint the extracted fact, and let the read path choose which tier to hit. This is the journal/checkpoint hybrid from databases applied to agent memory; the distributed-systems-engineer instinct that “you want both” ports over without modification.

Write amplification is the cost-model frame. SSDs surface the same problem at the hardware layer: the cost of a logical write is not one physical write — it’s one logical write plus the garbage-collection and wear-leveling overhead that the device has to perform to keep the storage usable. A write that triggers 10x physical writes is a write-amplification factor of 10, and SSD firmware spends enormous effort to keep that factor low. Memory writes have an exact analogue: the cost of writing one episode is not one row insert — it is the row insert plus the embedding call (typically 0.5-2ms and a fraction of a cent), plus the entity-resolution pass (an LLM call in many frameworks), plus the dedup-against-existing check (another retrieval round-trip), plus the cache invalidation on any pre-paged warm-tier entry, plus the downstream cost of having one more row competing for top-K slots on every future read. A naive write policy treats only the row insert as the cost; the defensible policy budgets for the entire amplification chain and decides whether the marginal value of the entry justifies it. Mem0’s 2026 single-pass-ADD-only redesign, which cut write-time LLM calls 60-70% by deferring conflict resolution to retrieval time and running entity linking asynchronously, is exactly an amplification-reduction story — the same SSD-firmware optimization vocabulary, applied to a memory subsystem.

The four-stage write pipeline

The pattern that has converged across Mem0, LangMem, Letta, and A-MEM is a four-stage pipeline. Knowing the stages by name is what makes the design decisions in each framework legible:

Stage 1 — Triage. Given a candidate turn or exchange, decide whether it’s worth processing at all. The cheapest possible filter: skip system messages, skip empty assistant turns, skip pure conversational fillers (“Got it”, “Thanks”, “Can you repeat that?”). No model call; a regex or token-count heuristic does the work. The triage stage is what keeps the write pipeline from running expensive downstream stages over the high-volume floor of pure noise. Skipping triage and going straight to LLM extraction is the most common production cost bug — you’ll spend $5 a day per active user on extraction calls for “ok” turns.

Stage 2 — Extract / distill. Given a turn that cleared triage, transform it into the shape you actually want to store. Three common shapes: keep the raw episode (journal-only), extract one or more structured facts ({type, subject, value, confidence}, checkpoint-only), or both (hybrid). The extract stage is typically a small-model call (Mem0’s default is GPT-5-mini, Letta’s classifier path uses a smaller model, most hand-rolled systems use Haiku or Llama-3-8B-Instruct), and the prompt anchors the schema and the confidence scale. The fact-shaped output is itself the unit of admission downstream — a single turn might extract zero facts (if it was pure conversation) or three (a preference, an event, a corrective). The distill step is what separates “memory” from “transcript.”

Stage 3 — Dedupe / resolve. Given an extracted fact, check whether it duplicates or contradicts something already in the store. Dedup against existing facts is typically a hybrid lookup: exact-match on a stable identifier (e.g., a normalized entity name) plus embedding similarity above a threshold. Conflict detection is harder — the new fact might update an old one (“user moved from SF to NYC”), supersede it (“user no longer works at Acme”), or contradict it (likely a hallucination, should be flagged for review). The 2026 production answer has converged: don’t try to resolve conflicts at write time in the synchronous path; write the new fact, mark the old one with a back-reference, and let the read-time rerank or a background reconciliation pass surface the conflict. The full ADD/UPDATE/DELETE/NOOP resolver, the user-vs-system arbitration policy, and the contradiction-density curve all get the dedicated treatment in the memory conflict and forgetting article. A-MEM’s Zettelkasten-inspired linking extends this idea further — every new memory dynamically updates the contextual attributes of related existing memories at write time, building an explicit knowledge graph as a side effect of the write path.

Stage 4 — Persist / index. Given an admitted fact, embed it, write it to the storage tier, and update any auxiliary indices (entity graph, by-time index, by-user index). This is the cheapest physical operation in the pipeline; it dominates only if stages 1-3 have already filtered hard. Persist is also where the tier decision lands: a high-confidence durable preference goes to the hot/core tier, a moderate-confidence fact goes to the warm/recall tier, a low-confidence raw episode goes to the cold/archival tier.

The shape of a defensible memory subsystem is all four stages run on every write, with the cost of each stage bounded and the failure of each stage handled explicitly. Skipping stage 1 floods stages 2-4 with garbage; skipping stage 2 turns the store into a transcript dump; skipping stage 3 lets contradictions accumulate; skipping stage 4 means the fact never makes it into a queryable tier. Production failures usually trace to a missing stage rather than to a bad stage.

When the policy runs: hot-path, deferred, or background

The single biggest cost lever in a memory write policy is when it runs relative to the user-facing turn.

Hot-path admission (synchronous). Every turn ends with the full write pipeline running synchronously before the agent returns control to the user. Pros: the next turn (in this session or any other) sees the new memory; no risk of dropping memories if the process crashes between the user turn and a deferred pass. Cons: the user-facing latency includes the full extract-dedupe-persist cost; on a turn that didn’t need a memory, you’ve paid for stages 1-2 anyway; the per-turn LLM call for extraction roughly doubles the agent’s per-turn cost. Best fit: short interactive conversations where same-session continuity matters and the user tolerates a 200-500ms latency tax.

Deferred at session-end (batched). The agent buffers candidate writes in working memory through the session; when the session closes (explicit logout, idle timeout, or natural turn-of-topic) a single batched write pass processes the entire session. Pros: amortizes the extraction cost across many turns (one call instead of N); produces higher-quality extracted facts because the model sees full context; no per-turn latency tax. Cons: in-session continuity has to come from the working-memory scratchpad or the conversation buffer, not from long-term memory; if the session ends abruptly (process crash, network drop) all of it is lost; the user can’t reference a fact from earlier in the session via memory retrieval — only via the in-context conversation log. Best fit: customer-support sessions, bounded interactions with a defined end, anywhere the value of cross-session continuity dominates the value of within-session continuity.

Background / sleep-time (asynchronous). A separate process consumes a raw journal (every turn, written unfiltered) and runs the extract-dedupe-persist pipeline in the background, on a queue, often using a different model than the foreground agent. LangMem’s background memory manager is the productized version of this; Letta’s sleep-time agents are a richer variant that also consolidates the existing store. Pros: zero impact on user-facing latency; full session context available to the extractor; can use larger, slower, more accurate models than the foreground agent can afford; the journal serves as a durable audit log even if the background pipeline fails. Cons: there is a window where new memories aren’t yet queryable (the consistency lag from journal to indexed tier); operationally more complex — you now have two pipelines, a queue, and a failure mode where the journal grows faster than the extractor drains it; and the journal itself is unbounded by default, which puts the write-amplification problem back at the storage layer if you don’t add log compaction. Best fit: high-throughput multi-tenant systems where per-turn latency is critical and the operator has the engineering budget to run and monitor a background pipeline.

The defensible production answer is a mix: a fast triage on the hot path (sub-millisecond, no model call), a deferred extract pass at session-end for the common case, and a background reconciliation pass that runs nightly to dedupe, consolidate, and clean up. The shape mirrors what databases do: the WAL writes synchronously for durability, the checkpoint flushes asynchronously for read performance, and the autovacuum runs in the background to clean up bloat. The same three-tier pattern, applied one layer up.

Distillation: from turn to fact

The extract stage is where most of the policy’s intelligence lives, and it has converged on a remarkably consistent prompt shape across frameworks. The pattern, distilled from the Mem0 fact-extraction prompt and LangMem’s memory manager:

Anchor the schema. Tell the model exactly the structured shape you want: {type: "preference"|"fact"|"event"|"correction", subject: string, predicate: string, object: string, confidence: 0-1, valid_from: timestamp?}. The schema is what makes the extracted facts dedupable downstream — without consistent shape, the dedup stage can’t compare apples to apples. The valid_from field is the upstream half of the bi-temporal model that the temporal-reasoning article works on the read side; writing it consistently here is what lets the read path answer “as of when?” queries later.
Anchor the confidence scale. “0.2 = a guess from one ambiguous turn, 0.5 = an inference from a single clear statement, 0.9 = the user explicitly stated it, 1.0 = a system-of-record import.” Without anchoring, confidence scores cluster at 0.7 ± 0.1 and become useless for downstream filtering.
Anchor what not to extract. “Do not extract: small talk, the assistant’s own opinions, content that depends on session-specific context, hypothetical or counterfactual statements.” This is the negative-example list that prevents the extract stage from over-producing — without it, the model will extract a “preference” from “I think pizza could be good for dinner tonight” and clutter the store.
Return JSON, return an array. Always return a list, even if zero or one. The extractor returns zero facts as often as it returns one — zero-fact turns are the common case for any conversational system, and the policy has to handle them as a normal output, not an error.

The single-pass ADD-only redesign Mem0 shipped in 2026 is worth knowing as a specific instance of this pattern. The old pipeline ran two LLM passes: the first extracted candidate facts, the second compared each candidate against the existing store and emitted an ADD / UPDATE / DELETE / NOOP decision. The new pipeline runs one LLM pass that emits ADD-only writes and defers all UPDATE/DELETE reconciliation to either retrieval-time scoring or an asynchronous consolidation job. Reported result: 60-70% fewer write-time LLM calls, with the read path absorbing the conflict-resolution work via the recency-weighted rerank from the Generative Agents paper — newer ADDs naturally outscore older ones for the same fact. The trade-off is that the store now contains both versions of any fact-with-history, and the dedup work happens at read time rather than write time. The redesign is a classic case of move the cost off the hot path — the read path’s bounded top-K naturally drops the older versions, and the cheaper writes pay for themselves immediately.

The journal-and-checkpoint pattern

The shape that holds up best in production is the journal-and-checkpoint hybrid:

The journal tier is an append-only raw episodic log. Every meaningful turn after triage (stage 1 only — no extraction, no dedup, no embedding-at-write-time other than maybe a cheap hash). The journal is the audit trail — “what did the agent observe?” — and the durable source of truth that everything else can be rebuilt from. Backed by a database table with indexes on (user, session, ts) for replay; not necessarily embedded.

The checkpoint tier is the extracted-fact store. Every fact that came out of the extract stage (stage 2), deduped against existing checkpoints (stage 3), and indexed for retrieval (stage 4). This is the tier that gets pre-paged into the hierarchical memory hot tier, the tier the read path queries on the fast path, the tier the agent treats as “what do we know about this user.”

The relationship is one-way: the journal feeds the checkpoint, never the reverse. The checkpoint can be rebuilt from the journal (by re-running the extract pipeline); the journal cannot be reconstructed from the checkpoint (because the extract step is lossy by design). The journal grows linearly with usage; the checkpoint grows sub-linearly because most turns extract zero facts and most extracted facts dedupe against existing ones. After six months of heavy usage you might have 50k journal entries and 800 checkpoints — and the read path queries the 800, not the 50k.

The journal-and-checkpoint pattern is what makes the MemoryBank Ebbinghaus-curve mechanism operationally tractable: MemoryBank’s R = e^(-t/S) formulation increases memory strength S on every recall (resetting t), so frequently-used checkpoints stay fresh while unused ones decay. The decay is on the checkpoint tier — the journal still has the original episodes for any audit query — and the decayed-out checkpoints can be regenerated from the journal if needed. This is exactly the WAL-plus-checkpoint pattern from databases, with the recall-strength term as the equivalent of a page’s reference-count for cache-replacement decisions.

Code: Python — an admission-controlled write gate

The smallest interesting build: a four-stage write pipeline with triage, distill, dedupe, and persist, against Chroma for the checkpoint tier and a sqlite-backed journal. Uses the Anthropic SDK for the extract stage. Install: pip install anthropic chromadb.

python

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
import hashlib
import json
import sqlite3
import time
import uuid
from anthropic import Anthropic

client = Anthropic()
EXTRACT_MODEL = "claude-haiku-4-5"  # small model for the high-frequency write path

# --------- Stage 1: Triage (no model call) ---------
SKIP_PATTERNS = {"ok", "thanks", "got it", "k", "ty", "thx", "yes", "no"}

def triage(role: str, content: str) -> bool:
    """Cheapest possible filter. No LLM. Returns True if the turn is a candidate."""
    if role == "system":
        return False
    text = content.strip().lower()
    if len(text) < 5:
        return False
    if text in SKIP_PATTERNS:
        return False
    # Skip pure clarification-request turns.
    if text.startswith(("can you", "could you")) and len(text) < 40:
        return False
    return True

# --------- Stage 2: Extract / distill (one small-model call) ---------
EXTRACT_PROMPT = """Extract durable facts from this turn. Return JSON: a list of
fact objects, or [] if nothing is worth remembering across sessions.

Schema per fact:
  {"type": "preference"|"fact"|"event"|"correction",
   "subject": str, "predicate": str, "object": str,
   "confidence": float in [0,1],
   "rationale": str}

Confidence anchor:
  0.2 = ambiguous inference from one turn
  0.5 = inference from a clear statement
  0.9 = user explicitly stated it
  1.0 = system-of-record import

DO NOT EXTRACT: small talk, the assistant's own opinions, session-specific
context, hypotheticals, or counterfactuals. Return [] when in doubt.

Turn:
  role: {role}
  content: {content}"""

def extract(role: str, content: str) -> list[dict]:
    resp = client.messages.create(
        model=EXTRACT_MODEL,
        max_tokens=500,
        messages=[{"role": "user",
                   "content": EXTRACT_PROMPT.format(role=role, content=content)}],
    )
    try:
        text = "".join(b.text for b in resp.content if b.type == "text")
        facts = json.loads(text)
        return facts if isinstance(facts, list) else []
    except (json.JSONDecodeError, KeyError):
        return []  # malformed -> drop; never silently store junk

# --------- Stage 3: Dedupe / resolve (one cheap embedding lookup) ---------
def fact_key(fact: dict) -> str:
    """Stable identifier for exact-match dedup. Two facts with the same
    (subject, predicate) collide; conflict resolution happens on update."""
    key = f"{fact['subject'].lower()}::{fact['predicate'].lower()}"
    return hashlib.sha1(key.encode()).hexdigest()[:16]

def is_duplicate(checkpoints, fact: dict, threshold: float = 0.92) -> str | None:
    """Returns the existing fact ID if this fact is a near-duplicate."""
    fact_text = f"{fact['subject']} {fact['predicate']} {fact['object']}"
    hits = checkpoints.query(query_texts=[fact_text], n_results=3)
    if not hits["ids"][0]:
        return None
    for doc_id, distance in zip(hits["ids"][0], hits["distances"][0]):
        # Chroma L2 -> rough similarity proxy in [0,1]
        similarity = 1.0 / (1.0 + distance)
        if similarity >= threshold:
            return doc_id
    return None

# --------- Stage 4: Persist ---------
def persist(checkpoints, journal_conn, user: str, fact: dict, journal_id: str):
    fact_text = f"{fact['subject']} {fact['predicate']} {fact['object']}"
    checkpoints.add(
        documents=[fact_text],
        metadatas=[{
            "user": user, "type": fact["type"],
            "subject": fact["subject"], "predicate": fact["predicate"],
            "object": fact["object"], "confidence": fact["confidence"],
            "key": fact_key(fact), "journal_id": journal_id,
            "ts": time.time(), "strength": 1.0,  # MemoryBank-style decay seed
        }],
        ids=[str(uuid.uuid4())],
    )

# --------- The full write pipeline ---------
def init_journal(path: str = "journal.db") -> sqlite3.Connection:
    conn = sqlite3.connect(path)
    conn.execute("""CREATE TABLE IF NOT EXISTS journal (
        id TEXT PRIMARY KEY, user TEXT, session TEXT, role TEXT,
        content TEXT, ts REAL,
        extracted INTEGER DEFAULT 0  -- 0=pending, 1=processed
    )""")
    conn.execute("CREATE INDEX IF NOT EXISTS j_user_ts ON journal(user, ts)")
    return conn

def write_turn(checkpoints, journal_conn, user: str, session: str,
               role: str, content: str, *, hot_path: bool = True):
    """The journal write is unconditional (audit trail). The checkpoint write
    runs the gate, and can be deferred to a background pass if hot_path=False."""
    journal_id = str(uuid.uuid4())
    journal_conn.execute(
        "INSERT INTO journal VALUES (?, ?, ?, ?, ?, ?, 0)",
        (journal_id, user, session, role, content, time.time()),
    )
    journal_conn.commit()

    if not hot_path:
        return  # let the background pass do stages 1-4

    # Stage 1: triage. No model call; sub-millisecond.
    if not triage(role, content):
        journal_conn.execute(
            "UPDATE journal SET extracted=1 WHERE id=?", (journal_id,))
        journal_conn.commit()
        return

    # Stage 2: extract. One small-model call.
    facts = extract(role, content)
    if not facts:
        journal_conn.execute(
            "UPDATE journal SET extracted=1 WHERE id=?", (journal_id,))
        journal_conn.commit()
        return

    # Stages 3 & 4: dedupe and persist, per fact.
    for fact in facts:
        if fact["confidence"] < 0.5:
            continue  # drop low-confidence at the gate
        if is_duplicate(checkpoints, fact):
            continue  # already known; let the existing entry stand
        persist(checkpoints, journal_conn, user, fact, journal_id)

    journal_conn.execute(
        "UPDATE journal SET extracted=1 WHERE id=?", (journal_id,))
    journal_conn.commit()

# --------- Background consolidation pass ---------
def process_pending_journal(checkpoints, journal_conn, batch: int = 50):
    """Run as a separate process or cron. Drains the pending journal queue."""
    cur = journal_conn.execute(
        "SELECT id, user, session, role, content FROM journal "
        "WHERE extracted=0 ORDER BY ts LIMIT ?", (batch,))
    for row in cur.fetchall():
        journal_id, user, session, role, content = row
        if triage(role, content):
            for fact in extract(role, content):
                if fact["confidence"] >= 0.5 and not is_duplicate(checkpoints, fact):
                    persist(checkpoints, journal_conn, user, fact, journal_id)
        journal_conn.execute(
            "UPDATE journal SET extracted=1 WHERE id=?", (journal_id,))
    journal_conn.commit()

Three things to notice. First, the journal write is unconditional — even turns that fail triage land in the journal; the journal is the audit trail, and we want a future run of the extractor (possibly with a better prompt, possibly with a larger model) to be able to revisit them. Second, the four stages are explicit and individually skippable: production tuning happens at the stage boundary, not inside the stages. Want cheaper writes? Tighten triage. Want higher precision? Raise the confidence threshold in stage 3. Want background-only? Set hot_path=False everywhere and run process_pending_journal from cron. Third, the dedup pass uses both exact-match and similarity — fact_key is the stable identifier for known-duplicate detection, the similarity check catches semantic duplicates. A real production system would also pass the new fact through an LLM “is this an update to an existing fact?” check when similarity is in the 0.6-0.9 ambiguous range; for clarity that stage is omitted here.

Code: TypeScript — admission control with Mem0

The TypeScript version delegates the extract-dedupe-persist pipeline to Mem0, which ships the production-grade version of stages 2-4, and adds an explicit triage stage on top so we can see how custom admission control composes with a framework. Install: npm install mem0ai @anthropic-ai/sdk. Mem0 requires OPENAI_API_KEY for its default embeddings.

typescript

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
import { Memory } from "mem0ai/oss";
import Anthropic from "@anthropic-ai/sdk";

const memory = new Memory();           // Mem0 handles stages 2-4 internally
const anthropic = new Anthropic();     // for the agent itself

// --------- Stage 1: explicit triage on top of Mem0 ---------
const SKIP_PATTERNS = new Set([
  "ok", "thanks", "got it", "k", "ty", "thx", "yes", "no",
]);

const triage = (role: string, content: string): boolean => {
  if (role === "system") return false;
  const text = content.trim().toLowerCase();
  if (text.length < 5) return false;
  if (SKIP_PATTERNS.has(text)) return false;
  if ((text.startsWith("can you") || text.startsWith("could you")) && text.length < 40) {
    return false;
  }
  return true;
};

// --------- Hot-path admission: triage in foreground, persist via Mem0 ---------
type Turn = { role: "user" | "assistant"; content: string };

const writeTurnHotPath = async (
  userId: string, sessionId: string, role: "user" | "assistant", content: string,
) => {
  // Triage first — never pay for Mem0's extract pass on noise.
  if (!triage(role, content)) return;

  // Mem0 runs its own four-stage pipeline: ADD-only extraction + entity linking
  // + dedup against the existing store + persist. We pay one LLM call per write.
  await memory.add(
    [{ role, content }],
    { userId, metadata: { session: sessionId, ts: Date.now() } },
  );
};

// --------- Session-end deferred write: buffer turns, extract in one pass ---------
const sessionBuffers = new Map<string, Turn[]>();

const bufferTurn = (sessionId: string, role: "user" | "assistant", content: string) => {
  const buf = sessionBuffers.get(sessionId) ?? [];
  buf.push({ role, content });
  sessionBuffers.set(sessionId, buf);
};

const flushSession = async (userId: string, sessionId: string) => {
  const buf = sessionBuffers.get(sessionId);
  if (!buf || buf.length === 0) return;

  // Filter at the buffer level — Mem0 sees only triage-passing turns.
  const candidates = buf.filter((t) => triage(t.role, t.content));
  if (candidates.length === 0) {
    sessionBuffers.delete(sessionId);
    return;
  }

  // One batched extract call over the entire session. Mem0's single-pass
  // ADD-only extractor handles dedupe internally; this amortizes the LLM cost
  // across all turns in the session.
  await memory.add(
    candidates,
    { userId, metadata: { session: sessionId, flushed_at: Date.now() } },
  );
  sessionBuffers.delete(sessionId);
};

// --------- Read path: standard Mem0 search, no policy logic here ---------
const recall = async (userId: string, query: string, limit = 5) => {
  const { results } = await memory.search(query, { userId, limit });
  return results;
};

// --------- The turn loop ---------
const turn = async (userId: string, sessionId: string, userMsg: string) => {
  // Hot-path option: write the user turn synchronously.
  await writeTurnHotPath(userId, sessionId, "user", userMsg);

  const memories = await recall(userId, userMsg);
  const memoryBlock = memories
    .map((m) => `- [${m.metadata?.session ?? ""}] ${m.memory}`)
    .join("\n") || "(none)";

  const resp = await anthropic.messages.create({
    model: "claude-opus-4-7",
    max_tokens: 1024,
    system: `You are a personal travel assistant. Known facts:\n${memoryBlock}`,
    messages: [{ role: "user", content: userMsg }],
  });
  const reply = resp.content
    .filter((b): b is Anthropic.TextBlock => b.type === "text")
    .map((b) => b.text)
    .join("");

  // Deferred option (alternative): buffer the assistant turn for batched flush.
  bufferTurn(sessionId, "assistant", reply);
  // Caller invokes flushSession(userId, sessionId) on session close.
  return reply;
};

// Demo
await turn("u-42", "s-001", "I'm vegetarian, allergic to peanuts, traveling with a toddler.");
await turn("u-42", "s-001", "ok thanks");        // skipped by triage; no LLM call
await turn("u-42", "s-001", "What about lunch in Lisbon?");
await flushSession("u-42", "s-001");              // session-end batch flush

The pattern Mem0 enforces structurally — userId as a required parameter, the single-pass ADD-only extractor, the framework-internal dedup — is exactly the journal/checkpoint hybrid above, with Mem0 owning the checkpoint side. The triage filter on top is what we add as the local policy: Mem0 will gladly extract on every turn you give it, so the cost-control layer belongs upstream of memory.add. The two write modes (writeTurnHotPath and the buffered flushSession) show the hot-path-vs-deferred trade-off as a code-level decision; in production the same harness usually uses one or the other based on the workload’s latency budget.

Comparison: how the production frameworks each draw the line

Five frameworks worth knowing by their write-policy stance.

Mem0: single-pass ADD-only extraction; conflict resolution deferred to retrieval or background; entity linking as a parallel write to a _entities collection. As of April 2026, the extract path uses gpt-5-mini by default and a hybrid retrieval (semantic + keyword + entity) at read time. Best when you want a turn-key checkpoint-only memory layer and are willing to give up the journal tier — Mem0 doesn’t preserve raw turns by default.

LangMem: explicit dual-mode design with both “hot path” (synchronous extraction inline with the agent) and “background” (asynchronous extraction via a separate manager) modes, switchable per memory type. The background manager periodically consolidates the store — dedup, summarize, prune. Best when you want explicit control over which memory types extract synchronously vs asynchronously.

Letta: journal-and-checkpoint operationalized. The recall tier is the journal (raw conversation), the archival tier is the checkpoint (extracted facts and summaries), and sleep-time agents run a background pass that consolidates recall into archival on a schedule. Best when the hierarchy is the defining feature of the system and you want first-class operations for tier promotion / demotion.

A-MEM: NeurIPS 2025 paper that takes the dedupe stage further — every new memory not only writes itself but updates the contextual attributes of related existing memories at write time, building a dynamic Zettelkasten-style link graph. The write cost is higher (more LLM calls per write) but the resulting structure means read-time queries traverse pre-built links rather than recomputing similarity from scratch. Best when read-time latency matters more than write-time cost and the workload has long-horizon retrieval patterns.

MemoryBank (AAAI 2024): introduced the Ebbinghaus-curve-inspired decay-and-reinforce write policy. Each memory carries a strength S; retention follows R = e^(-t/S). Every recall increases S by 1 and resets t to 0 — frequently-used memories stay; unused ones decay out. The decay is a read-influenced write (the recall pass updates the strength field), which makes the policy adaptive without an explicit classifier. Best as a conceptual model rather than a turnkey framework; many production systems implement the strength-and-decay idea without citing MemoryBank.

The 2026 production answer is converging on Mem0 or Letta for the bulk substrate, LangMem for explicit dual-mode control, the A-MEM dynamic-linking idea as a write-time enhancement, and the MemoryBank strength-and-decay model as a maintenance pattern. The production memory frameworks article works the full Letta/mem0/Zep/Graphiti comparison matrix; today’s takeaway is that the write policy is the layer where each framework’s design philosophy is most visible.

Trade-offs, failure modes, and gotchas

The write-amplification trap. A naive “every turn writes one row plus one embedding plus one extract call plus one dedup query plus one entity link” pipeline costs ~$0.01 per turn at small-model rates. At 1000 turns per active user per day that’s $10/day/user — well above the per-user revenue of most consumer products. The fix is aggressive triage: a 90% triage drop rate cuts the bill by 10x at zero quality cost, because the dropped turns weren’t worth extracting in the first place.

The skip-triage cost bug. Skipping triage and sending every turn directly to the extract LLM is the single most common cost bug in hand-rolled memory systems. The triage stage costs nothing (no model call); the extract stage costs ~$0.001 per call at small-model rates. Skipping triage roughly doubles the per-turn cost, with zero quality benefit because the extract model returns [] for the skipped turns anyway. Always triage first; the cheapest stage of any pipeline goes first.

The confidence-collapse failure. Without an anchored confidence scale in the extract prompt, the extractor returns 0.7-0.8 for almost every fact. The confidence threshold downstream becomes a no-op, and either every fact passes (high noise) or you lower the threshold until garbage gets through. The fix is the anchor-the-scale pattern from the distill section — explicitly tell the model what each confidence level means.

The deferred-write data-loss bug. A session-end-only write policy loses every memory from sessions that don’t end cleanly — process crashes, network drops, client closes the tab without an explicit logout. The mitigation is checkpoint deferred writes to the journal at every turn (cheap, no extraction) and only defer the extract stage, so worst case the background pass picks up where the foreground left off. The journal-and-checkpoint pattern is what makes deferred writes recoverable.

The conflict-resolution-on-write cost bomb. The old Mem0 pipeline ran two LLM calls per write: one to extract, one to decide ADD/UPDATE/DELETE against existing memories. The second call scales with the complexity of the existing store (more existing facts → more reasoning the model has to do), which means the per-write cost grows over time and is unpredictable. Mem0’s 2026 ADD-only redesign explicitly moves this work off the hot path. The rule of thumb: don’t reason about the existing store at write time; let read-time scoring or a background consolidation pass handle conflicts.

The schema drift bug. When the extract prompt changes (you tweak the schema, add a field, anchor the confidence scale differently), the new facts have a different shape than the old ones — and downstream dedup, retrieval, and rerank logic might assume the old shape. The fix is version your extract schema and tag every fact with the schema version that produced it; the dedup pass can then compare apples to apples, and a one-off backfill job can re-extract old facts to the new schema. The same problem the text-embeddings article flagged for embedding-model upgrades; schema upgrades have the same shape.

The over-extraction failure mode. A loose extract prompt that doesn’t anchor “what not to extract” will produce 3-5 facts per turn for the average conversation. After a 50-turn session you have 200 facts; after a year you have 30k. Retrieval quality drops, dedup work explodes, and the agent’s “what do we know about this user” answer becomes a wall of redundant micro-facts. The fix is the negative-example list in the extract prompt and a strict downstream confidence threshold — prefer false negatives over false positives at the write gate, because false positives accumulate and false negatives can be backfilled from the journal later.

The under-extraction failure mode. The inverse. A too-strict prompt that demands “only extract facts the user explicitly stated, verbatim” misses 70% of the useful signal — agents commonly say “based on what you’ve told me, you’re vegetarian and traveling with a toddler” and that inferred fact is the one you want to remember. The mitigation is to extract at multiple confidence levels (0.5 for inferences, 0.9 for explicit statements) and let the read path weight by confidence — both kinds of fact have value; suppressing one entirely is a workload loss.

The journal-bloat sub-problem. A journal that retains every turn forever grows unboundedly. After a year of heavy usage you have millions of journal entries per user, most of which will never be read. The fix is journal compaction: after some retention window (90 days, six months), drop journal entries whose corresponding extracted facts are still in the checkpoint tier. The journal’s role is to support late re-extraction and audit; once a turn has been distilled, its journal entry is largely redundant. Kafka’s compaction policies, Postgres autovacuum, and SSD garbage collection all solve a version of this problem; the rules port over directly.

The privacy-leak-via-extract bug. The extract model sees the raw turn; if the turn contains PII or sensitive data the user didn’t intend to persist, a literal extraction stores it durably. The mitigation is a redaction pass between triage and extract — strip out emails, phone numbers, credit-card-shaped strings, and anything else the workload classifies as sensitive — before the turn ever reaches the extractor. The PII detection and data privacy article covers the detection cascade and reversible-tokenization patterns; the write policy is where the redaction belongs because it’s the only stage upstream of every downstream copy of the data.

The “policy as a magic incantation” anti-pattern. Teams sometimes ship a complex write policy and treat it as set-and-forget. The defensible alternative is to instrument the policy: log triage hit rate, extract output count distribution, dedup collision rate, persist throughput. The right policy for one workload is wrong for another, and without instrumentation the tuning loop never closes. Track the inputs and outputs of every stage; the policy will need to be tuned at the six-month mark in any non-trivial workload.

What to read next

Long-Term Memory: Vector-Backed Episodic Storage — the substrate this article’s write policies feed into. The episodic store is what gets written to; the write policy decides what gets written. Together they form the full write path.
Episode Segmentation and Salience Scoring — the upstream pair of decisions every write policy depends on. Segmentation picks the unit of admission (the five signals, from fixed-window floor up to agent-emitted markers); salience scoring assigns the weight (the anchored 1-10 importance prompt and its failure modes).
Production Memory Frameworks: MemGPT/Letta, mem0, Zep, Graphiti — the full comparison matrix for the four frameworks this article references in passing. The write policy is the layer where each framework’s design philosophy is most visible; the matrix turns “Mem0 or Letta for the bulk substrate” into a workload-specific decision.
Sleep-Time Compute and Memory Consolidation — the regime the background-policy variant of this article runs in. The hot-path-vs-deferred-vs-background trade-off this piece names lands operationally in the sleep-time tier; the consolidation, dedupe, and journal-compaction passes the write policy enables all run there.