jatin.blog ~ $
$ cat ai-engineering/production-memory-frameworks.md

Production Memory Frameworks: MemGPT/Letta, mem0, Zep, Graphiti

MemGPT/Letta, mem0, Zep, and Graphiti compared on architecture, write/read paths, benchmarks, and the build-versus-buy decision for production memory.

Jatin Bansal@blog:~/ai-engineering$ open production-memory-frameworks

A team has shipped twenty memory primitives and now has to pick what actually runs in production. Vector store, episodic log, bi-temporal graph, reflection job, conflict resolver, multi-tenant scoping, eval harness — all twenty are real pieces of code with real maintenance budgets. The question every team eventually asks is the unglamorous one: do we keep building these ourselves, or adopt a framework that has converged on most of them? The four frameworks that have absorbed most of the design space — Letta (the MemGPT productization), mem0, Zep, and Graphiti — each ship a different opinion about which primitives are load-bearing. Picking the right one is downstream of understanding what each is optimized for; the wrong choice locks you into a write pipeline you cannot easily migrate off. This piece is the capstone of the memory subtree: the comparison matrix, the build-versus-buy decision, the integration patterns, and the failure modes each frame still has in mid-2026.

Opening bridge

Yesterday’s piece on memory evaluation closed the measurement axis: LoCoMo, LongMemEval, BEAM, MemoryAgentBench, the per-category breakdown, the precision/recall harness, the contradiction-resolution category where every framework still scores under 6%. The eighteen articles before it walked the mechanics — write policies, retrieval policies, hierarchical memory, knowledge graphs, conflict and forgetting, sleep-time compute. Every one of those articles flagged the same forward reference: the production frameworks article will work the comparison matrix. Today’s piece is that matrix. The memory subtree closes here; the next subtree (Evaluation) opens tomorrow with eval-driven development — which builds on the same eval discipline this article uses to compare frameworks.

Definition

A production memory framework is a runtime that bundles the write pipeline, storage substrate(s), read pipeline, multi-tenancy primitives, and maintenance passes into a single SDK. A substrate (pgvector, Qdrant) is unopinionated; a framework picks an episode shape, a write gate, a retrieval blend, a tier policy, and a tenant model, then exposes them as a coherent add/search/update/delete API. Adopting a framework is buying its opinions.

Four define the field in 2026. MemGPT/Letta is the productized version of the original MemGPT paper — three-tier hierarchical memory (core/recall/archival) with the agent self-managing tier promotion via tool calls. mem0 is the distill-at-write vector layer with an optional graph extension (Mem0g); the LLM-gated fact-extraction pipeline runs on every add. Zep is the graph-first hybrid: a bi-temporal knowledge graph wraps vector and BM25 indexes, all retrieval fused, no LLM in the read path. Graphiti is Zep’s open-source temporal-graph engine, usable standalone when you want the bi-temporal substrate without the cloud product on top.

Intuition

The mental model that compresses the four-way comparison: each framework optimizes a different point on the write-cost-versus-read-cost curve, and the right choice is determined by which side of that curve your workload’s hot path sits on.

  • Letta pays the least at write time and lets the agent decide what to promote. The framework gives you the tier topology and the tool surface (core_memory_append, archival_search); the policy lives in the prompt.
  • mem0 pays heavily at write time (one or two LLM calls per turn to extract facts) so reads stay vector-only and cheap. Mem0g adds a parallel graph write for relational queries; the read still stays LLM-free.
  • Zep pays the most at write time — entity extraction, relation extraction, deduplication, bi-temporal stamping, contradiction detection — and runs a pure-traversal read. All the model work happens during ingest.
  • Graphiti is the substrate Zep is built on, exposed standalone. Same bi-temporal graph, same write cost, no cloud product around it.

The right framework is the one whose write/read asymmetry matches your workload’s traffic pattern. A high-write, low-read background job benefits least from mem0/Zep’s write-heavy designs; a chatbot with 10 reads per write benefits most.

The distributed-systems parallel

The four frameworks line up cleanly against four database-design archetypes. Letta is an in-memory store with paging to disk — the agent’s context is the hot tier, recall and archival are disk, tool calls are page faults; see the hierarchical memory article for the deep dive. mem0 is a denormalized read-optimized store with an LLM-driven ETL on the write path — facts as materialized view; the OLAP read-optimized cube pattern. Zep is a graph database with vector and BM25 secondary indexes — graph as source of truth, auxiliary indexes as fuzzy fallback; the bi-temporal columns (valid_time and transaction_time) are the audit-heavy transactional database pattern lifted directly. Graphiti is the embedded engine version of the same graph database, the way SQLite is to Postgres — same data model, hosted yourself, fewer batteries included.

The disanalogy: database systems have decades of vendor stability; these frameworks ship breaking changes monthly. Treat the API contracts in this article as approximately right for mid-2026; read the current docs before integration.

The comparison matrix

DimensionLetta (MemGPT)mem0ZepGraphiti
Primary substrateHierarchical (core/recall/archival)Vector + optional graphGraph + vector + BM25Bi-temporal graph
Write path costLow (DB write + tool call)High (LLM fact extraction per turn)Very high (entity + relation extraction + bi-temporal stamping)Very high (same as Zep)
Read path costLow (tool calls, no LLM in core)Low (vector search + optional graph traversal)Low (pure traversal + RRF, no LLM)Low (pure traversal + RRF)
Bi-temporalNo (single transaction clock)NoYes (valid + transaction time)Yes
Self-managed by agentYes (agent calls tier-promotion tools)No (harness-driven)No (harness-driven)No
Multi-tenancyPer-agent state (built-in)user_id required parameteruser_id / session_id (built-in)group_id namespace
Hosted optionLetta Cloud + self-hostmem0 Cloud + open-sourceZep Cloud + open-source community editionSelf-host only
2026 benchmark anchor~83% LongMemEval (community report)94.4% LongMemEval (token-efficient algorithm, mem0.ai/research)71.2% LongMemEval (Zep paper)Same substrate as Zep
Best-fit workloadLong-running stateful agents where the agent itself manages contextChatbots and assistants with high user-fact densityCRM, compliance, healthcare — relational + temporal queries dominateGreenfield graph-first builds, self-hosted
Worst-fit workloadStateless or short-lived agentsWorkloads where raw episodes matter more than distilled factsWorkloads with no relational structureSame as Zep, plus teams who want a managed product

The benchmark numbers in that row are the most volatile entry in the table. Mem0’s LoCoMo score went from 66.9% in 2025 to 92.5% in 2026 — partly genuine algorithm improvement, partly protocol stabilization, partly hill-climbing. Read protocols (judge model, ingest pipeline, top-K) before comparing across rows. The memory evaluation article is the deep dive on why direct cross-framework comparison is harder than it looks.

Build versus buy

Reach for a framework when (a) your team has fewer than two engineers who can own a memory subsystem long-term, (b) your workload fits within ±20% of one of the four frameworks’ opinions, and (c) you don’t have an existing storage layer the framework would fight. Hand-roll when (a) you have those engineers, (b) your defining write or read pattern isn’t covered (per-document write policies for a legal-research agent, custom segmentation for a code-review agent, or a graph schema that doesn’t fit Graphiti’s entity-relation model), or (c) you already operate a vector store and a graph store and the framework’s opinions about both fight your data model.

The most common mistake: adopting a framework, then writing so much code around it to make it fit that you would have been better off rolling your own. mem0 and Letta both have escape hatches — custom prompts, overrideable extraction, custom tools — but every escape hatch is a place the next breaking change will land. If you find yourself overriding more than two defaults, the framework is wrong for your workload.

The escape from the binary: roll your own on a store primitive and adopt a framework only for the layer where its opinions are load-bearing. LangGraph stores give you a tuple-namespaced KV/vector store; you build the write policy, retrieval blend, and tier topology yourself, and adopt Graphiti only for the graph layer if your workload needs bi-temporal queries. Most teams converge on this after a year.

Integration pattern 1: Letta (Python)

The Letta integration pattern is the framework owns the agent state, the application owns the message routing. You call the SDK with messages; Letta manages the memory blocks, the recall/archival tiers, and the persistence behind a client.agents.create / client.agents.messages.create surface. Install: pip install letta-client and run a local server (docker run -d -p 8283:8283 letta/letta:latest) or use Letta Cloud.

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# pip install letta-client
from letta_client import Letta

client = Letta(base_url="http://localhost:8283")

# Create an agent with hierarchical memory: core memory blocks are always in context,
# recall and archival tiers live outside the prompt and the agent pages them in via tools.
agent = client.agents.create(
    model="openai/gpt-4o-mini",
    embedding="openai/text-embedding-3-small",
    memory_blocks=[
        {"label": "persona", "value": "You are a customer-support agent for Acme.",
         "limit": 1000},
        {"label": "human", "value": "Unknown user.", "limit": 2000},
    ],
    tools=["web_search"],  # core_memory_append, recall_search, archival_search are auto-injected
)

# Each call returns the full step trace; Letta has already updated memory blocks,
# inserted into recall/archival, and persisted state. No explicit write call needed.
reply = client.agents.messages.create(
    agent_id=agent.id,
    messages=[{"role": "user", "content": "I'm Priya, my order #4321 hasn't shipped."}],
)
assistant_msg = next(m for m in reversed(reply.messages) if m.message_type == "assistant_message")
print(assistant_msg.content)

# Inspect the updated memory blocks — the agent should have written "Priya" and "#4321"
# into the human block via core_memory_append on its own.
state = client.agents.retrieve(agent_id=agent.id)
for block in state.memory.blocks:
    print(block.label, "→", block.value)

The key property: there is no explicit memory.add() call. The agent decides what to remember, and the framework records what it decided. This is Letta’s central opinion — agent-driven memory management — and it is either exactly right (long-running stateful agents that learn from their interactions) or exactly wrong (workflows where the harness, not the agent, owns the write policy).

Integration pattern 2: mem0 (TypeScript)

The mem0 integration pattern is the inverse: the application owns the message routing, mem0 owns the memory write/read on every turn. You call memory.add(messages, { userId }) after each user turn and memory.search(query, { userId }) before each model call. The framework extracts facts during add and serves the relevant subset during search. Install: npm install mem0ai (open-source mode) or use Mem0 Cloud.

typescript
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
// npm install mem0ai
import { Memory } from "mem0ai/oss";

// Vector store + optional graph (Mem0g). Set OPENAI_API_KEY in the environment.
const memory = new Memory({
  vectorStore: { provider: "qdrant", config: { collectionName: "agent_memory", host: "localhost", port: 6333 } },
  // Uncomment to enable Mem0g — adds graph store for relational queries.
  // graphStore: { provider: "neo4j", config: { url: "bolt://localhost:7687", username: "neo4j", password: "password" } },
});

// Write: pass the conversation; mem0 runs LLM fact extraction internally
// and stores the distilled claims, not the raw turns.
await memory.add(
  [
    { role: "user", content: "I'm Priya. Order #4321 hasn't shipped yet." },
    { role: "assistant", content: "Hi Priya, let me check on order #4321." },
  ],
  { userId: "user_priya_123" },
);

// Read: before the next model call, pull the relevant memories for the user.
// search() returns the distilled facts mem0 chose to store.
const results = await memory.search("What's the user's name and pending order?", {
  userId: "user_priya_123",
  limit: 5,
});
for (const r of results.results) {
  console.log(`[score=${r.score.toFixed(2)}]`, r.memory);
}

// Inject results into the next system prompt. The harness, not the framework,
// owns this rendering step — mem0 returns the facts; you decide how to use them.
const memoryContext = results.results.map((r) => `- ${r.memory}`).join("\n");
const systemPrompt = `## Known facts about the user:\n${memoryContext}\n\nReply to the user.`;
// ... pass systemPrompt to your LLM SDK of choice.

The opinion this framework ships: the unit of long-term memory is the distilled fact, not the raw turn. If you want the raw turns preserved verbatim, mem0 fights you — that’s not what the framework optimizes for. The escape hatch is memory.add with infer: false, which skips extraction and stores the raw text, but that path is not what the LoCoMo and LongMemEval numbers in the marketing are measured on.

Integration pattern 3: Graphiti (Python sketch)

Graphiti’s contract is give me episodes with timestamps, I’ll give you a temporally-correct knowledge graph and a fused vector+BM25+graph retriever. Install: pip install graphiti-core and run Neo4j (docker run -d -p 7687:7687 -e NEO4J_AUTH=neo4j/password neo4j:5).

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# pip install graphiti-core neo4j
import asyncio
from datetime import datetime
from graphiti_core import Graphiti

graphiti = Graphiti(
    uri="bolt://localhost:7687", user="neo4j", password="password",
)

async def main():
    await graphiti.build_indices_and_constraints()

    # add_episode triggers entity extraction, relation extraction, dedup,
    # bi-temporal stamping, and contradiction detection in one call.
    await graphiti.add_episode(
        name="march_check_in",
        episode_body="Priya is my manager. She approved the Q1 forecast on March 15.",
        source_description="user message",
        reference_time=datetime(2026, 3, 15),
        group_id="tenant_acme",  # multi-tenant scope
    )
    await graphiti.add_episode(
        name="april_update",
        episode_body="Devansh is my new manager starting April 15. Priya moved to a different team.",
        source_description="user message",
        reference_time=datetime(2026, 4, 15),
        group_id="tenant_acme",
    )

    # The read path: search() is hybrid (graph traversal + vector + BM25 fused with RRF),
    # no LLM call in the loop. Filter by group_id for tenant isolation.
    results = await graphiti.search(query="Who manages this user as of March 20?", group_ids=["tenant_acme"])
    for edge in results:
        print(edge.fact, "valid:", edge.valid_at, "->", edge.invalid_at)

asyncio.run(main())

The bi-temporal property is load-bearing: the “as of March 20” query returns Priya (the manager as-of that date), not Devansh (the current one). Vector-only stores cannot answer that question correctly regardless of how their retrieval is scored. If your workload doesn’t include point-in-time queries, this property is wasted complexity; if it does, no other framework ships it as a first-class concept.

Trade-offs, failure modes, gotchas

Letta — the over-pinned-core failure mode. An agent with no demotion discipline grows its core blocks until the prompt becomes attention-thin (the lost-in-the-middle effect) or hits the context limit. Per-block limit is a structural mitigation; the looser failure mode is the agent that pins everything because it doesn’t know what to demote. The hierarchical memory article covers this.

mem0 — the fact-extraction-bias failure mode. The write-time LLM call is what makes the read cheap; it is also where information the conversation only made implicit gets lost. Sarcasm, conditionals (“if it rains, I’ll skip the meeting”), and multi-turn negotiations all flatten badly through fact extraction. Fix: layer mem0 over a raw episode log — mem0 for facts, your own table for raw turns — and read from both depending on query type.

Zep / Graphiti — entity-extraction cost. The write path is dominated by small-model calls for entity and relation extraction; at scale that’s one to three calls per episode. Mitigations are the standard expensive-write playbook: batch when latency tolerates, skip extraction for low-value episodes, budget explicitly. The write policies article is the triage-stage deep dive.

Benchmark numbers are non-comparable across protocols. Mem0’s 94.4% LongMemEval and Zep’s 71.2% LongMemEval used different judges, prompts, and ingest pipelines. Treat the numbers as within-framework deltas, not cross-framework rankings. The evaluation article works through the LLM-as-judge bias and protocol drift.

Lock-in cost of distillation. mem0’s distill-at-write locks you into its extraction logic — migration means replaying conversations through the new framework’s extractor (expensive) or losing accumulated facts (lossy). Letta’s recall tier and Graphiti’s episode log both preserve raw episodes alongside derived structure, giving a cleaner exit path.

Conflict resolution is uniformly weak. MemoryAgentBench’s multi-hop contradiction resolution stays under ~6% across all four. None ship a robust solution; the conflict-and-forgetting article covers the supersession-versus-deletion patterns you layer on top. Bi-temporal substrates (Zep/Graphiti) at least give you the data model.

Sleep-time-compute compatibility is uneven. Letta ships sleep-time agents as a first-class concept; mem0 and Zep support background consolidation but the pattern is more DIY. The sleep-time-compute article covers when this matters — and for high-throughput multi-tenant workloads where idle time is scarce, none of the frameworks ship a perfect answer.

Further reading

  • Memory Evaluation: Benchmarks and Custom Evals — the measurement layer that calibrates every framework comparison in this piece. Before adopting a framework on the strength of its published numbers, run the protocols there against your workload.
  • Hierarchical Memory: Working / Episodic / Semantic Tiers — the architecture that the MemGPT/Letta side of the matrix instantiates. The OS-paging model, the core/recall/archival tier definitions, and the promotion/demotion policies are the substrate Letta productizes.
  • Knowledge Graphs as Structured Memory — the architecture that the Zep/Graphiti side of the matrix instantiates. The bi-temporal model, the hybrid graph+vector retrieval, and the entity-extraction write path covered there are what Graphiti operationalizes.
  • Memory Write Policies: What’s Worth Remembering — the layer where each framework’s design philosophy is most visible. mem0’s distill-at-write, Letta’s agent-driven append, Zep’s entity-extraction pipeline — all three are write-policy variants of the same four-stage pipeline that article covers.