jatin.blog ~ $
$ cat ai-engineering/long-term-memory.md

Long-Term Memory: Vector-Backed Episodic Storage

Long-term episodic memory: vector-backed storage, episode boundaries, recency-weighted retrieval, the WAL parallel, and the unit-of-recall problem.

Jatin Bansal@blog:~/ai-engineering$ open long-term-memory

A travel-planning agent finishes a productive session on a Tuesday: the user mentioned they’re vegetarian, traveling with a toddler, allergic to peanuts, and prefer hotels with a pool. The agent files an itinerary, the conversation ends, the harness drops the in-memory state. On Thursday the same user comes back to ask about restaurants in Lisbon. The agent — running the same code, against the same model, with the same prompts — has no idea who this person is. It re-asks every preference. The user closes the tab. The product team calls it “the goldfish bug” and ships an emergency hotfix that pastes the last 50 turns of every prior session into the system prompt. The context window groans, costs triple, retrieval-from-context degrades, and the agent still feels dumb because the prior turns are 80% noise. The fix is not bigger prompts. The fix is a long-term memory tier — a deliberate, durable, queryable store of past episodes that survives the session boundary and gets retrieved on demand. This article is the deep dive on the workhorse substrate: vector-backed episodic storage.

Opening bridge

Yesterday’s piece on working memory covered the in-context tier above the conversation buffer — typed state, scratchpads, blackboards, the agent’s notebook for the current task. Working memory’s lifetime is the task; it evaporates when the harness drops the run context. The piece flagged what comes next: “when the task ends the harness can drop it or promote selected items to episodic or semantic memory” — and that promotion is the bridge into today’s tier. The memory stack overview named long-term memory as the storage side of the in-context-vs-storage line; the cognitive taxonomy split that storage side into episodic, semantic, and procedural. Today’s article picks up the largest of the three — episodic memory — because it is the substrate every other long-term tier eventually feeds from, and because the design questions it raises (what counts as an episode, when to write, how to retrieve, what to do when the store gets noisy) recur in every later memory article in the subtree.

Definition

Long-term episodic memory is a durable, append-mostly store of past agent experiences, keyed for similarity retrieval, where each entry is intended to be replayable in enough context to inform a future call. Four properties separate it from the tiers above it. First, it is durable — survives session end, process restart, and (assuming a real backend) host failure. Second, it is episode-shaped — entries are scoped to discrete units of past experience (a user turn, a tool result, a completed interaction), each with enough metadata to be intelligible on its own. Third, it is append-mostly — corrections happen by appending a new episode that supersedes the old one, not by rewriting history in place. Fourth, it is retrieved on demand — entries are not in the prompt until the harness explicitly fetches and injects them, paying both a retrieval round-trip and a context-budget cost when it does.

What episodic long-term memory is not: it is not the conversation buffer (that’s in-context, per-session, evicted under token pressure). It is not the working-memory scratchpad (that’s task-scoped, structured, mutable in place). It is not semantic memory (that’s distilled facts about the world, not raw past events) — though semantic facts are typically extracted from episodic entries by a reflection pass. It is not procedural memory either (that’s cached how-to recipes, not the record of doing them). All four tiers can sit physically in the same vector store with different metadata tags; what makes one of them “episodic long-term memory” is the read/write contract above, not the database row.

Intuition

The mental model that pays off is a write-ahead log with semantic indexing bolted on. The log is the source of truth — every meaningful turn the agent observed gets appended, in order, with a timestamp, an actor, and the content. The index is the semantic-search layer that lets the agent ask “what past entries are relevant to my current query?” without having to scan the log linearly. The two are coupled: a log without an index is unsearchable noise after a thousand entries; an index without a log is a smooth-talking liar that can’t show its work when you ask “why do you think the user is vegetarian?”

Concretely, the substrate is a vector store with structured metadata. Each row is {id, embedding, text, session_id, actor, timestamp, importance, [optional: source-episode-id]}. The embedding makes the similarity search work; the metadata makes filtered queries possible (“only this user’s episodes,” “only the last 30 days,” “only assistant turns above importance 0.7”). The text is the raw episode content; the timestamp drives recency-weighted ranking; the optional source pointer makes provenance traceable when a downstream semantic fact disagrees with the world.

Two design questions force themselves on every implementation. The first is what counts as one episode — the unit of recall, sometimes called episode granularity. A single user message? A user/assistant pair? An entire session summarized into one row? Each choice has different retrieval characteristics, and getting it wrong is the most common reason a “we added memory” project doesn’t move the user-satisfaction needle. The second is what to embed — the raw text, a paraphrase, a hypothetical question the episode answers (the HyDE trick applied to writes), a structured summary. Both decisions are upstream of every retrieval-quality knob you might later tune.

The distributed-systems parallel

Three honest parallels worth naming.

Episodic memory is a write-ahead log, almost literally. The properties match: monotonic, ordered, append-only by default, every entry intelligible in isolation, replayable from any cursor forward. The reason WALs exist in databases is the same reason episodic memory exists in agents — you need a durable record of what happened so that the system can reconstruct state after a crash and so that future computations (compaction, reflection, audits) can operate against a fixed history. Postgres’s WAL is the source of truth that the page cache derives from; the episodic store is the source of truth that the working-memory scratchpad and the semantic-facts store derive from. Rewriting episodic history in place breaks the same invariants in agents that rewriting a WAL in place breaks in databases — and for the same reason: it destroys the audit trail that downstream computations assume is stable.

Episodic memory + reranking is the read path of a search engine. Once you have a vector index over episodes, the retrieval pass is a hybrid search problem: dense similarity over embeddings, optional sparse keyword filter for hard requirements (“only episodes mentioning the project name”), and a rerank pass that weighs recency, importance, and tag matches alongside the raw similarity score. The retrieval stack from the RAG subtree is the engine; the memory framing adds two signals (recency, importance) that pure-corpus RAG never has to worry about. Treating episodic recall as “cosine-similarity top-K and we’re done” is the same mistake as treating production search as “BM25 top-10 and we’re done” — it works for the first 1000 documents and silently degrades after.

Reflection — the maintenance pass that turns episodes into semantic facts — is log compaction. Kafka log compaction keeps the latest value per key and prunes older versions; agentic reflection takes a window of related episodes (“everything the user told me about their dietary preferences”) and emits a single distilled fact (“user is vegetarian, allergic to peanuts”) that supersedes the raw episodes for most future reads. The raw episodes can still be retained as the durable record; the compacted fact lives in the semantic store and is what gets pinned to the system prompt every turn. The reflection piece is the dedicated deep dive — the importance-threshold trigger, the salient-question generation step, the evidence-citation pattern that grounds insights to their source episodes, and the failure modes (self-reinforcing error, over-generalization) that turn naive reflection into a confirmation-bias engine.

The unit-of-recall problem

Three common granularities, in order of how much each is worth using.

Per-message episode — each user or assistant turn is its own row. Highest fidelity, highest cardinality. Best when the agent needs to retrieve specific things the user said (“when did the user mention they were vegetarian?”). Worst when relevance is contextual — a one-line “yes” message is meaningless without its question, but a per-message episode strips that pairing. Mitigation: store the previous-turn ID as metadata and re-fetch the pair when a hit lands on a short reply.

Per-exchange episode — each user turn plus the assistant’s reply is one row. Better contextual integrity, lower cardinality. The dominant default in production frameworks. Mem0 defaults to pair-shaped writes through its add(messages, user_id=...) API; the framework processes the message list and emits one or more memory entries per call.

Per-session episode — each session compresses to a single summary row with a date range. Lowest fidelity, lowest cardinality. Works when the unit of “I’d like to recall this past interaction” is the whole session (“the conversation we had last Tuesday about Lisbon”). Almost always paired with one of the higher-fidelity tiers — you want the session summary for fast similarity search and the per-message entries for the detailed lookup once a session is retrieved.

The right choice is workload-dependent. A customer-support agent recalling “did the user mention which OS they were on?” wants per-message episodes. A long-running personal assistant recalling “what were we working on last week?” wants per-session summaries. Most production systems end up running two tiers in parallel — a summary tier for navigation and a per-exchange tier for the details — which adds storage cost but pays for itself in retrieval quality. The memory benchmark literature is consistent on the point: hierarchical granularity outperforms flat-per-message in every long-multi-session benchmark published in the last 18 months. The full mechanics of deciding the unit at write time — the segmentation algorithm, the cognitive-science grounding from event segmentation theory, and the layered combination of fixed-window, semantic-shift, prediction-error, structural, and agent-emitted signals — get their own deep dive in the episode segmentation and salience scoring article.

The write path: what’s worth remembering

The naive “store every turn” policy works for a while and then collapses. The collapse happens around the 1k-episode mark, when retrieval signal-to-noise crosses the threshold where the right episode is still in the store but no longer in the top-K. A defensible write policy classifies each turn before writing.

Three policies in increasing sophistication:

  1. Heuristic write gate. Skip system messages, skip empty assistant turns, skip pure clarifications (“Can you repeat that?”). Cheap, no model call, catches the worst noise.
  2. LLM-classified write gate. A small-model call returns {should_write: bool, importance: float, type: "preference"|"fact"|"event"}. Slower, ~50ms per turn at small-model latencies, but the precision is dramatically better. This is the policy Mem0’s fact-extraction pipeline uses, and the Mem0 paper attributes most of their recall lift to it.
  3. Deferred write at session end. Skip per-turn writes entirely; at session close, run a summarization pass over the whole session and write 2–5 distilled episodes. Cheapest at write time, lowest fidelity, ideal for short bounded interactions.

The trap is to skip the write policy and tell yourself you’ll add it later. By the time you have 10k uncurated episodes, the cost of cleaning the corpus exceeds the cost of building the gate from the start. The memory write policies article is the deep dive on the classifier design — the four-stage pipeline, the journal-and-checkpoint pattern, and the hot-path-vs-deferred-vs-background trade-off; the rule of thumb for now is have a policy, even a heuristic one, on day one.

The read path: recency × importance × similarity

The retrieval pass over an episodic store is not the same as RAG retrieval over a static corpus. Pure cosine similarity gives the agent the most textually similar episode, which is often not the most useful one — the user’s preference from last month is more useful than a textually similar message from a year ago that has since been contradicted.

The canonical formulation, from Generative Agents (Park et al., 2023), is a weighted combination:

text
1
score(episode | query) = α·recency(episode) + β·importance(episode) + γ·similarity(query, episode)

with each term normalized to [0, 1]. Recency uses an exponential decay since the episode was last retrieved (not just written — episodes that keep proving useful keep their freshness); importance is an LLM-assigned 1–10 score at write time, normalized; similarity is cosine over the embedding. The weights are typically equal (α = β = γ = 1) as the published baseline; tuning them is a workload-specific calibration step. The Generative Agents paper rated importance with a prompt that explicitly anchored “1 = mundane (brushing teeth)” and “10 = pivotal (entering college, getting divorced)” — that anchoring matters; LLM ratings without anchored scales drift toward all-5s. The episode segmentation piece works the anchored 1-10 salience prompt in detail (and the per-segment-vs-per-turn unit question), so this section can stay focused on the read-time rerank.

In production this means every read happens in two stages: a coarse vector recall pulls the top-K candidates by similarity (K ~ 3-5× the final budget), then the rescore stage applies the recency and importance terms and returns the top-N. This is the same two-stage retrieve-then-rerank pattern that wins on RAG, with different rerank signals. The memory retrieval policies article works the formula in detail — per-candidate normalization, read-driven last_read and retrieval-count updates, the LRU/LFU/ARC cache-replacement parallel, and the workload-specific weight tuning — and is the read-path companion to today’s substrate. Today’s frame is: don’t ship cosine-only retrieval against an episodic store and expect it to feel like memory.

Code: Python — episodic store with recency-weighted retrieval

The smallest interesting build: an LLM-gated write path, a recency-weighted read path, and a turn loop that demonstrates cross-session recall. Uses the Anthropic SDK for the model and Chroma as the local vector store. Install: pip install anthropic chromadb.

python
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
import json
import math
import time
import uuid
from anthropic import Anthropic
import chromadb

client = Anthropic()
chroma = chromadb.Client()
episodes = chroma.get_or_create_collection("episodes")

MODEL = "claude-opus-4-7"

# --------- write path: classify, then maybe store ---------
WRITE_GATE_PROMPT = """Decide whether this turn is worth durably remembering for
future sessions. Return JSON only.

Schema: {"should_write": bool, "importance": int 1-10, "type": "preference"|"fact"|"event"|"none"}

Importance scale: 1 = mundane filler, 5 = useful context, 10 = pivotal fact about
the user or task that would change future answers.

Turn:
- role: {role}
- content: {content}"""

def classify_for_write(role: str, content: str) -> dict | None:
    resp = client.messages.create(
        model="claude-haiku-4-5",  # small-model gate keeps the write path cheap
        max_tokens=200,
        messages=[{"role": "user",
                   "content": WRITE_GATE_PROMPT.format(role=role, content=content)}],
    )
    try:
        verdict = json.loads("".join(b.text for b in resp.content if b.type == "text"))
        return verdict if verdict.get("should_write") else None
    except (json.JSONDecodeError, KeyError):
        return None  # malformed -> skip; never silently store junk

def write_episode(session: str, user: str, role: str, content: str):
    """Append-only write. Classification is what makes this episodic, not log-spam."""
    verdict = classify_for_write(role, content)
    if not verdict:
        return
    episodes.add(
        documents=[content],
        metadatas=[{
            "session": session, "user": user, "role": role,
            "ts": time.time(), "last_read": time.time(),
            "importance": verdict["importance"] / 10.0,
            "type": verdict["type"],
        }],
        ids=[str(uuid.uuid4())],
    )

# --------- read path: recency × importance × similarity ---------
def read_episodes(user: str, query: str, top_n: int = 5, recall_k: int = 20) -> list[dict]:
    """Two-stage: coarse vector recall, then rescore with recency and importance."""
    hits = episodes.query(
        query_texts=[query],
        n_results=recall_k,
        where={"user": user},  # tenant isolation; never skip this in multi-user systems
    )
    if not hits["ids"][0]:
        return []
    now = time.time()
    scored: list[tuple[float, str, dict, str]] = []
    for doc_id, doc, meta, distance in zip(
        hits["ids"][0], hits["documents"][0], hits["metadatas"][0], hits["distances"][0]
    ):
        # Recency: exponential decay since last read, half-life ~7 days
        age_days = (now - meta["last_read"]) / 86_400
        recency = math.exp(-age_days / 7)
        importance = meta["importance"]
        # Chroma returns L2 distance; convert to a [0,1] similarity proxy
        similarity = 1.0 / (1.0 + distance)
        # Equal weights baseline; calibrate per workload.
        score = recency + importance + similarity
        scored.append((score, doc_id, meta, doc))
    scored.sort(reverse=True)
    # Update last_read for retrieved episodes — episodes that keep proving useful
    # stay fresh; episodes nobody reads decay out of the top-K naturally.
    for _, doc_id, meta, _ in scored[:top_n]:
        meta["last_read"] = now
        episodes.update(ids=[doc_id], metadatas=[meta])
    return [{"text": doc, "meta": meta} for _, _, meta, doc in scored[:top_n]]

# --------- turn loop ---------
SYSTEM_TEMPLATE = """You are a personal travel assistant with episodic memory.

## Relevant past episodes (retrieved on demand)
{episodes}

Use the episodes to answer in a way consistent with what the user has told you
before. Do not re-ask things the episodes already answer."""

def turn(session: str, user: str, user_msg: str) -> str:
    retrieved = read_episodes(user, user_msg, top_n=5)
    episodes_block = "\n".join(
        f"- [{e['meta']['type']}, importance={e['meta']['importance']:.1f}] {e['text']}"
        for e in retrieved
    ) or "(none)"
    resp = client.messages.create(
        model=MODEL,
        max_tokens=1024,
        system=SYSTEM_TEMPLATE.format(episodes=episodes_block),
        messages=[{"role": "user", "content": user_msg}],
    )
    reply = "".join(b.text for b in resp.content if b.type == "text")
    write_episode(session, user, "user", user_msg)
    write_episode(session, user, "assistant", reply)
    return reply

# Tuesday session
print(turn("s-001", "u-42", "I'm vegetarian, allergic to peanuts, traveling with a toddler."))
# Thursday session — different session ID, same user
print(turn("s-002", "u-42", "What should I eat for lunch in Lisbon?"))

Three things to notice. First, the write path is gated by a small-model classifier — the gate is what separates “episodic memory” from “transcript dump.” A naive harness that writes every turn will fill the store with noise inside a week. Second, the where={"user": user} filter is mandatory — long-term memory without tenant isolation is a data-leak waiting to happen, and the next time the agent serves user B it must not retrieve user A’s episodes. Third, last_read decays, not ts — episodes that keep proving useful keep their freshness; this is the LRU/LFU hybrid the Generative Agents paper formalized, and it matters more than the absolute decay constant. The single biggest bug I see in hand-rolled memory implementations is decaying on write timestamp only; an episode the agent has retrieved 50 times should not decay at the same rate as one written and never read.

Code: TypeScript — same shape against LangGraph stores

The TypeScript version delegates the storage layer to LangGraph and uses its InMemoryStore with the namespace-and-key API. Install: npm install @langchain/langgraph @langchain/anthropic @langchain/core @langchain/openai. In production swap InMemoryStore for LangGraph’s Postgres store.

typescript
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
import { ChatAnthropic } from "@langchain/anthropic";
import { OpenAIEmbeddings } from "@langchain/openai";
import { InMemoryStore } from "@langchain/langgraph";
import { HumanMessage, SystemMessage } from "@langchain/core/messages";
import { randomUUID } from "node:crypto";

const model = new ChatAnthropic({ model: "claude-opus-4-7" });
const gateModel = new ChatAnthropic({ model: "claude-haiku-4-5" });

// LangGraph store with semantic search enabled; the namespace tuple is the
// tenant scope, the key is the episode ID, the value is the episode payload.
const store = new InMemoryStore({
  index: {
    dims: 1536,
    embed: new OpenAIEmbeddings({ model: "text-embedding-3-small" }),
  },
});

type Episode = {
  text: string;
  role: "user" | "assistant";
  session: string;
  ts: number;
  lastRead: number;
  importance: number; // 0..1
  type: "preference" | "fact" | "event";
};

// --------- write path ---------
const classify = async (role: string, content: string)
  : Promise<Pick<Episode, "importance" | "type"> | null> => {
  const resp = await gateModel.invoke([
    new HumanMessage(
      `Decide whether this turn is worth durably remembering. Return JSON only.\n` +
      `Schema: {"should_write":bool,"importance":1-10,"type":"preference"|"fact"|"event"|"none"}\n` +
      `Anchor: 1 = mundane, 10 = pivotal.\n` +
      `role: ${role}\ncontent: ${content}`,
    ),
  ]);
  try {
    const v = JSON.parse(resp.content as string);
    if (!v.should_write) return null;
    return { importance: v.importance / 10, type: v.type };
  } catch { return null; }
};

const writeEpisode = async (user: string, session: string, role: "user" | "assistant", text: string) => {
  const verdict = await classify(role, text);
  if (!verdict) return;
  const now = Date.now();
  const ep: Episode = { text, role, session, ts: now, lastRead: now, ...verdict };
  // namespace tuple = tenant scope; episode ID = key. The text field is what gets embedded.
  await store.put(["episodes", user], randomUUID(), ep);
};

// --------- read path ---------
const readEpisodes = async (user: string, query: string, topN = 5, recallK = 20)
  : Promise<Episode[]> => {
  const hits = await store.search(["episodes", user], { query, limit: recallK });
  if (hits.length === 0) return [];
  const now = Date.now();
  const scored = hits.map((h) => {
    const ep = h.value as Episode;
    const ageDays = (now - ep.lastRead) / 86_400_000;
    const recency = Math.exp(-ageDays / 7);
    // LangGraph's search returns a score in [0,1] (higher = more similar).
    const similarity = h.score ?? 0;
    return { ep, key: h.key, total: recency + ep.importance + similarity };
  });
  scored.sort((a, b) => b.total - a.total);
  const top = scored.slice(0, topN);
  // Refresh lastRead on retrieved episodes.
  await Promise.all(top.map(({ ep, key }) =>
    store.put(["episodes", user], key, { ...ep, lastRead: now }),
  ));
  return top.map((s) => s.ep);
};

// --------- turn ---------
export const turn = async (user: string, session: string, userMsg: string): Promise<string> => {
  const retrieved = await readEpisodes(user, userMsg);
  const episodesBlock = retrieved.length
    ? retrieved.map((e) => `- [${e.type}, importance=${e.importance.toFixed(1)}] ${e.text}`).join("\n")
    : "(none)";
  const resp = await model.invoke([
    new SystemMessage(
      `You are a personal travel assistant with episodic memory.\n\n` +
      `## Relevant past episodes\n${episodesBlock}\n\n` +
      `Use the episodes; do not re-ask things they already answer.`,
    ),
    new HumanMessage(userMsg),
  ]);
  const reply = resp.content as string;
  await writeEpisode(user, session, "user", userMsg);
  await writeEpisode(user, session, "assistant", reply);
  return reply;
};

// Tuesday
await turn("u-42", "s-001", "I'm vegetarian, allergic to peanuts, traveling with a toddler.");
// Thursday — fresh session, same user
await turn("u-42", "s-002", "What should I eat for lunch in Lisbon?");

The namespace shape (["episodes", user]) is what LangGraph’s docs call the tenant scope — every read and every write is scoped to that tuple, which makes multi-user isolation a structural property rather than a query-level convention you can forget. Mem0’s API exposes the same idea differently: client.add(messages, user_id=...) and client.search(query, user_id=...) make user_id a required parameter, enforced at the API level — both designs are converging on the same principle (tenant scope is not optional in long-term memory) through different mechanisms.

Production frameworks: how each draws the line

Three frameworks worth knowing by the boundary they draw between episodic and semantic stores.

Mem0 runs an LLM-gated fact-extraction pipeline at write time: add(messages, user_id=...) accepts a conversation, extracts durable facts and preferences, and writes them as memory entries with type tags. The result is closer to “semantic memory with episodic timestamps” than a pure episodic log — Mem0 prioritizes the distilled claim over the raw episode. Their published numbers (Mem0 paper, Chhikara et al., 2024) show large recall gains over per-message storage and dramatic cost reduction at retrieval time; the 2026 numbers from their State of AI Agent Memory 2026 post extend that with the Mem0g graph extension. Best fit: workloads where the user’s preferences and durable facts matter more than the chronological log.

Letta (the MemGPT descendant) separates recall memory — raw conversation messages stored verbatim, semantically searchable — from archival memory — processed, summarized, indexed passages — and pins a small core-memory block in-context with persona, human, and task fields. Recall is the episodic log; archival is closer to semantic; core is the working-memory equivalent. The three-tier shape maps directly onto the cognitive taxonomy and is the closest off-the-shelf framework to “all four CoALA tiers as first-class concepts.”

LangGraph stores are a lower-level primitive: a JSON-document store organized by namespace and key, with optional embedding-based search. LangGraph deliberately does not enforce a memory model on top of the store — you decide whether your namespace holds episodes, facts, procedures, or a mix. The companion LangMem SDK layers an opinionated memory model (semantic, episodic, procedural namespaces with hot-path and background management) on top of the store API. Best fit: teams that want to build their own memory model and need a clean storage primitive rather than a prescriptive framework.

OpenAI Agents SDK ships a session abstraction with persistent backends (SQLAlchemySession, Redis, MongoDB, Dapr, advanced SQLite) but treats sessions as the short-term tier; long-term episodic memory is a layer you add on top, typically using one of the frameworks above or rolling a vector-store-backed get_items/add_items extension. The production memory frameworks article works the full Letta/mem0/Zep/Graphiti comparison matrix; today’s takeaway is that the three frameworks above represent three different stances — distill at write (Mem0), log everything + tiered retrieval (Letta), give me a store primitive and I’ll build the model (LangGraph) — and the right choice is downstream of which stance fits your workload.

Trade-offs, failure modes, and gotchas

The write-amplification trap. A naive “store every turn” policy works fine until the corpus crosses ~1k episodes per user, then retrieval precision collapses and the agent starts surfacing irrelevant memories. The fix is a write gate (heuristic, classifier, or session-end summarization); the Mem0 paper’s empirical result is that aggressive write-time filtering is where most of the long-term-memory quality comes from. Equivalent in chunking terms: low-information chunks are noise that competes with high-information chunks for top-K slots.

The tenant-isolation bug. Long-term memory without a hard scope filter on every read is a data leak. Some frameworks (Mem0) make user_id a required parameter; some (LangGraph) make the namespace tuple structural; some (hand-rolled) make it an optional argument you can forget. Audit your harness specifically for “did this read pass the user filter?” before shipping to production. A single missed filter on a debug endpoint is the kind of bug that ends up in a postmortem. The memory privacy and multi-tenancy article is the deep dive on the structural patterns (typed scope values, default-deny postures, verifiable GDPR deletion, the MINJA-class memory-injection attacks) that turn this from “remember to pass the argument” into a property the type system enforces.

The embedding-drift bug. Upgrade your embedding model and every previously-stored vector is now stale relative to query vectors. Retrieval degrades silently — no error, no alert, just gradually worse recall. The text embeddings article named the underlying mechanic; long-term memory is where it bites hardest because the store is constantly growing. Pin your embedding model version in your build, re-embed the entire store on any change, and treat embedding upgrades the same way you’d treat a schema migration. The memory conflict and forgetting article covers the migration patterns — dual-index swaps, alias-based versioning, and the Drift-Adapter affine map that defers re-encoding entirely — in detail.

The “last write wins” silent corruption. When the agent extracts a preference and writes it as a semantic-flavored episode (“user is vegetarian”) but later writes a contradicting one (“user just ordered steak”), naive last-write-wins lets a single hallucinated extraction corrupt the user model. Two mitigations: (1) keep both writes as episodes and let the recency-weighted retrieval surface both with the more recent one ranked higher, letting the model see the conflict; (2) gate writes-to-existing-keys with a confidence threshold — the new value must clear a confidence bar to overwrite. The latter is what production semantic stores do; the former is the safer episodic default. The memory conflict and forgetting article works the resolver-driven third option (ADD/UPDATE/DELETE/NOOP with soft-delete) in detail.

The cosine-only retrieval regression. Shipping cosine-only retrieval and calling it “memory” yields a system that retrieves textually-similar episodes regardless of recency or importance. The agent will surface a 4-month-old preference that has since been updated, or a one-off observation that happens to share keywords with the query. Always pair similarity with recency and importance, even with naive weights — equal-weight α=β=γ=1 already outperforms cosine-only in every published benchmark.

The episode-boundary mismatch. Storing per-message episodes when the use case wants session-level recall, or vice versa, produces a system where retrievals are technically correct but consistently miss what the user means. A debugging signal: if the agent finds the right episode but it’s missing the surrounding context to be useful, your boundary is too fine; if the agent finds the right session but can’t pinpoint the right moment, your boundary is too coarse. Hierarchical granularity (per-message + per-session, retrieved in tandem) is the production-grade fix.

The unbounded growth trap. Storage is cheap; retrieval over noisy storage is expensive. An episodic store with no compaction, no consolidation, no deletion policy will work fine for a year and then become the slowest, lowest-precision component of the agent. The maintenance layer — reflection, summarization and context compression, conflict resolution, deletion — is what keeps the store useful over time, and it’s the layer most often missing from MVP implementations. Future articles in this subtree work each remaining piece; the rule of thumb is budget for maintenance from the start, not as a v2 feature.

The “memory” feels great in demos and worse in production gotcha. Demos are short — five turns, one session, a fresh store. Production is long — months of sessions per user, contradictions, edge cases, stale facts. The performance gap is real and it’s why every published memory framework has a benchmark story to tell. Don’t ship until you’ve evaluated against a multi-session benchmark; the workhorse public ones are LongMemEval (Wu et al., 2024) (500 questions across 5 memory abilities, ICLR 2025) and LoCoMo (Maharana et al., 2024) (50 conversations × ~600 turns, with multi-hop and temporal categories). The memory evaluation article works them in detail with the precision/recall harness and the per-category breakdown patterns; today’s takeaway is to know they exist before shipping.

Further reading

  • Memory Write Policies: What’s Worth Remembering — the deep dive on the write path this article only sketched. The four-stage pipeline (triage, extract, dedupe, persist), the journal-and-checkpoint pattern, the hot-path-vs-deferred-vs-background trade-off, and a side-by-side of the policies Mem0, Letta, LangMem, A-MEM, and MemoryBank each ship.
  • Episode Segmentation and Salience Scoring — the upstream pair of operations the write policy depends on. Where one episode ends and the next begins (five segmentation signals from fixed windows up to agent-emitted markers), and how much the closed segment should weigh on the read-side rerank (the anchored 1-10 importance prompt).
  • Production Memory Frameworks: MemGPT/Letta, mem0, Zep, Graphiti — the four-way framework comparison the three stances in this article (Mem0, Letta, LangGraph) get worked out against. The matrix, the integration patterns, and the build-versus-buy decision turn the framework sketch here into a usable decision.
  • Memory Retrieval Policies: Recency, Relevance, Importance — the deep dive on the read path this article sketched. Per-candidate signal normalization, the LRU/LFU/ARC cache-replacement parallel, read-driven last_read and retrieval-count updates, the Ebbinghaus-curve-shaped MemoryBank strengthening, and the workload-specific weight-tuning patterns that take the equal-weights baseline the rest of the way.