Production Memory Frameworks: MemGPT/Letta, mem0, Zep, Graphiti
MemGPT/Letta, mem0, Zep, and Graphiti compared on architecture, write/read paths, benchmarks, and the build-versus-buy decision for production memory.
A team has shipped twenty memory primitives and now has to pick what actually runs in production. Vector store, episodic log, bi-temporal graph, reflection job, conflict resolver, multi-tenant scoping, eval harness — all twenty are real pieces of code with real maintenance budgets. The question every team eventually asks is the unglamorous one: do we keep building these ourselves, or adopt a framework that has converged on most of them? The four frameworks that have absorbed most of the design space — Letta (the MemGPT productization), mem0, Zep, and Graphiti — each ship a different opinion about which primitives are load-bearing. Picking the right one is downstream of understanding what each is optimized for; the wrong choice locks you into a write pipeline you cannot easily migrate off. This piece is the capstone of the memory subtree: the comparison matrix, the build-versus-buy decision, the integration patterns, and the failure modes each frame still has in mid-2026.
Opening bridge
Yesterday’s piece on memory evaluation closed the measurement axis: LoCoMo, LongMemEval, BEAM, MemoryAgentBench, the per-category breakdown, the precision/recall harness, the contradiction-resolution category where every framework still scores under 6%. The eighteen articles before it walked the mechanics — write policies, retrieval policies, hierarchical memory, knowledge graphs, conflict and forgetting, sleep-time compute. Every one of those articles flagged the same forward reference: the production frameworks article will work the comparison matrix. Today’s piece is that matrix. The memory subtree closes here; the next subtree (Evaluation) opens tomorrow with eval-driven development — which builds on the same eval discipline this article uses to compare frameworks.
Definition
A production memory framework is a runtime that bundles the write pipeline, storage substrate(s), read pipeline, multi-tenancy primitives, and maintenance passes into a single SDK. A substrate (pgvector, Qdrant) is unopinionated; a framework picks an episode shape, a write gate, a retrieval blend, a tier policy, and a tenant model, then exposes them as a coherent add/search/update/delete API. Adopting a framework is buying its opinions.
Four define the field in 2026. MemGPT/Letta is the productized version of the original MemGPT paper — three-tier hierarchical memory (core/recall/archival) with the agent self-managing tier promotion via tool calls. mem0 is the distill-at-write vector layer with an optional graph extension (Mem0g); the LLM-gated fact-extraction pipeline runs on every add. Zep is the graph-first hybrid: a bi-temporal knowledge graph wraps vector and BM25 indexes, all retrieval fused, no LLM in the read path. Graphiti is Zep’s open-source temporal-graph engine, usable standalone when you want the bi-temporal substrate without the cloud product on top.
Intuition
The mental model that compresses the four-way comparison: each framework optimizes a different point on the write-cost-versus-read-cost curve, and the right choice is determined by which side of that curve your workload’s hot path sits on.
- Letta pays the least at write time and lets the agent decide what to promote. The framework gives you the tier topology and the tool surface (
core_memory_append,archival_search); the policy lives in the prompt. - mem0 pays heavily at write time (one or two LLM calls per turn to extract facts) so reads stay vector-only and cheap. Mem0g adds a parallel graph write for relational queries; the read still stays LLM-free.
- Zep pays the most at write time — entity extraction, relation extraction, deduplication, bi-temporal stamping, contradiction detection — and runs a pure-traversal read. All the model work happens during ingest.
- Graphiti is the substrate Zep is built on, exposed standalone. Same bi-temporal graph, same write cost, no cloud product around it.
The right framework is the one whose write/read asymmetry matches your workload’s traffic pattern. A high-write, low-read background job benefits least from mem0/Zep’s write-heavy designs; a chatbot with 10 reads per write benefits most.
The distributed-systems parallel
The four frameworks line up cleanly against four database-design archetypes. Letta is an in-memory store with paging to disk — the agent’s context is the hot tier, recall and archival are disk, tool calls are page faults; see the hierarchical memory article for the deep dive. mem0 is a denormalized read-optimized store with an LLM-driven ETL on the write path — facts as materialized view; the OLAP read-optimized cube pattern. Zep is a graph database with vector and BM25 secondary indexes — graph as source of truth, auxiliary indexes as fuzzy fallback; the bi-temporal columns (valid_time and transaction_time) are the audit-heavy transactional database pattern lifted directly. Graphiti is the embedded engine version of the same graph database, the way SQLite is to Postgres — same data model, hosted yourself, fewer batteries included.
The disanalogy: database systems have decades of vendor stability; these frameworks ship breaking changes monthly. Treat the API contracts in this article as approximately right for mid-2026; read the current docs before integration.
The comparison matrix
| Dimension | Letta (MemGPT) | mem0 | Zep | Graphiti |
|---|---|---|---|---|
| Primary substrate | Hierarchical (core/recall/archival) | Vector + optional graph | Graph + vector + BM25 | Bi-temporal graph |
| Write path cost | Low (DB write + tool call) | High (LLM fact extraction per turn) | Very high (entity + relation extraction + bi-temporal stamping) | Very high (same as Zep) |
| Read path cost | Low (tool calls, no LLM in core) | Low (vector search + optional graph traversal) | Low (pure traversal + RRF, no LLM) | Low (pure traversal + RRF) |
| Bi-temporal | No (single transaction clock) | No | Yes (valid + transaction time) | Yes |
| Self-managed by agent | Yes (agent calls tier-promotion tools) | No (harness-driven) | No (harness-driven) | No |
| Multi-tenancy | Per-agent state (built-in) | user_id required parameter | user_id / session_id (built-in) | group_id namespace |
| Hosted option | Letta Cloud + self-host | mem0 Cloud + open-source | Zep Cloud + open-source community edition | Self-host only |
| 2026 benchmark anchor | ~83% LongMemEval (community report) | 94.4% LongMemEval (token-efficient algorithm, mem0.ai/research) | 71.2% LongMemEval (Zep paper) | Same substrate as Zep |
| Best-fit workload | Long-running stateful agents where the agent itself manages context | Chatbots and assistants with high user-fact density | CRM, compliance, healthcare — relational + temporal queries dominate | Greenfield graph-first builds, self-hosted |
| Worst-fit workload | Stateless or short-lived agents | Workloads where raw episodes matter more than distilled facts | Workloads with no relational structure | Same as Zep, plus teams who want a managed product |
The benchmark numbers in that row are the most volatile entry in the table. Mem0’s LoCoMo score went from 66.9% in 2025 to 92.5% in 2026 — partly genuine algorithm improvement, partly protocol stabilization, partly hill-climbing. Read protocols (judge model, ingest pipeline, top-K) before comparing across rows. The memory evaluation article is the deep dive on why direct cross-framework comparison is harder than it looks.
Build versus buy
Reach for a framework when (a) your team has fewer than two engineers who can own a memory subsystem long-term, (b) your workload fits within ±20% of one of the four frameworks’ opinions, and (c) you don’t have an existing storage layer the framework would fight. Hand-roll when (a) you have those engineers, (b) your defining write or read pattern isn’t covered (per-document write policies for a legal-research agent, custom segmentation for a code-review agent, or a graph schema that doesn’t fit Graphiti’s entity-relation model), or (c) you already operate a vector store and a graph store and the framework’s opinions about both fight your data model.
The most common mistake: adopting a framework, then writing so much code around it to make it fit that you would have been better off rolling your own. mem0 and Letta both have escape hatches — custom prompts, overrideable extraction, custom tools — but every escape hatch is a place the next breaking change will land. If you find yourself overriding more than two defaults, the framework is wrong for your workload.
The escape from the binary: roll your own on a store primitive and adopt a framework only for the layer where its opinions are load-bearing. LangGraph stores give you a tuple-namespaced KV/vector store; you build the write policy, retrieval blend, and tier topology yourself, and adopt Graphiti only for the graph layer if your workload needs bi-temporal queries. Most teams converge on this after a year.
Integration pattern 1: Letta (Python)
The Letta integration pattern is the framework owns the agent state, the application owns the message routing. You call the SDK with messages; Letta manages the memory blocks, the recall/archival tiers, and the persistence behind a client.agents.create / client.agents.messages.create surface. Install: pip install letta-client and run a local server (docker run -d -p 8283:8283 letta/letta:latest) or use Letta Cloud.
| |
The key property: there is no explicit memory.add() call. The agent decides what to remember, and the framework records what it decided. This is Letta’s central opinion — agent-driven memory management — and it is either exactly right (long-running stateful agents that learn from their interactions) or exactly wrong (workflows where the harness, not the agent, owns the write policy).
Integration pattern 2: mem0 (TypeScript)
The mem0 integration pattern is the inverse: the application owns the message routing, mem0 owns the memory write/read on every turn. You call memory.add(messages, { userId }) after each user turn and memory.search(query, { userId }) before each model call. The framework extracts facts during add and serves the relevant subset during search. Install: npm install mem0ai (open-source mode) or use Mem0 Cloud.
| |
The opinion this framework ships: the unit of long-term memory is the distilled fact, not the raw turn. If you want the raw turns preserved verbatim, mem0 fights you — that’s not what the framework optimizes for. The escape hatch is memory.add with infer: false, which skips extraction and stores the raw text, but that path is not what the LoCoMo and LongMemEval numbers in the marketing are measured on.
Integration pattern 3: Graphiti (Python sketch)
Graphiti’s contract is give me episodes with timestamps, I’ll give you a temporally-correct knowledge graph and a fused vector+BM25+graph retriever. Install: pip install graphiti-core and run Neo4j (docker run -d -p 7687:7687 -e NEO4J_AUTH=neo4j/password neo4j:5).
| |
The bi-temporal property is load-bearing: the “as of March 20” query returns Priya (the manager as-of that date), not Devansh (the current one). Vector-only stores cannot answer that question correctly regardless of how their retrieval is scored. If your workload doesn’t include point-in-time queries, this property is wasted complexity; if it does, no other framework ships it as a first-class concept.
Trade-offs, failure modes, gotchas
Letta — the over-pinned-core failure mode. An agent with no demotion discipline grows its core blocks until the prompt becomes attention-thin (the lost-in-the-middle effect) or hits the context limit. Per-block limit is a structural mitigation; the looser failure mode is the agent that pins everything because it doesn’t know what to demote. The hierarchical memory article covers this.
mem0 — the fact-extraction-bias failure mode. The write-time LLM call is what makes the read cheap; it is also where information the conversation only made implicit gets lost. Sarcasm, conditionals (“if it rains, I’ll skip the meeting”), and multi-turn negotiations all flatten badly through fact extraction. Fix: layer mem0 over a raw episode log — mem0 for facts, your own table for raw turns — and read from both depending on query type.
Zep / Graphiti — entity-extraction cost. The write path is dominated by small-model calls for entity and relation extraction; at scale that’s one to three calls per episode. Mitigations are the standard expensive-write playbook: batch when latency tolerates, skip extraction for low-value episodes, budget explicitly. The write policies article is the triage-stage deep dive.
Benchmark numbers are non-comparable across protocols. Mem0’s 94.4% LongMemEval and Zep’s 71.2% LongMemEval used different judges, prompts, and ingest pipelines. Treat the numbers as within-framework deltas, not cross-framework rankings. The evaluation article works through the LLM-as-judge bias and protocol drift.
Lock-in cost of distillation. mem0’s distill-at-write locks you into its extraction logic — migration means replaying conversations through the new framework’s extractor (expensive) or losing accumulated facts (lossy). Letta’s recall tier and Graphiti’s episode log both preserve raw episodes alongside derived structure, giving a cleaner exit path.
Conflict resolution is uniformly weak. MemoryAgentBench’s multi-hop contradiction resolution stays under ~6% across all four. None ship a robust solution; the conflict-and-forgetting article covers the supersession-versus-deletion patterns you layer on top. Bi-temporal substrates (Zep/Graphiti) at least give you the data model.
Sleep-time-compute compatibility is uneven. Letta ships sleep-time agents as a first-class concept; mem0 and Zep support background consolidation but the pattern is more DIY. The sleep-time-compute article covers when this matters — and for high-throughput multi-tenant workloads where idle time is scarce, none of the frameworks ship a perfect answer.
Further reading
- State of AI Agent Memory 2026 (mem0.ai) — the most current cross-framework benchmark report, with explicit protocols. Pair with Letta’s controlled-benchmark response for the inevitable vendor-disagreement view.
- MemGPT: Towards LLMs as Operating Systems (Packer et al., 2023) — the founding paper for hierarchical memory and the OS-paging analogy. Letta is its production embodiment.
- Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory (Chhikara et al., 2025) — the canonical reference for distill-at-write and Mem0g graph-augmented retrieval.
- Zep: A Temporal Knowledge Graph Architecture for Agent Memory (Rasmussen et al., 2025) — the bi-temporal-graph-as-primary-substrate case, with the LongMemEval breakdown that motivates the design.
What to read next
- Memory Evaluation: Benchmarks and Custom Evals — the measurement layer that calibrates every framework comparison in this piece. Before adopting a framework on the strength of its published numbers, run the protocols there against your workload.
- Hierarchical Memory: Working / Episodic / Semantic Tiers — the architecture that the MemGPT/Letta side of the matrix instantiates. The OS-paging model, the core/recall/archival tier definitions, and the promotion/demotion policies are the substrate Letta productizes.
- Knowledge Graphs as Structured Memory — the architecture that the Zep/Graphiti side of the matrix instantiates. The bi-temporal model, the hybrid graph+vector retrieval, and the entity-extraction write path covered there are what Graphiti operationalizes.
- Memory Write Policies: What’s Worth Remembering — the layer where each framework’s design philosophy is most visible. mem0’s distill-at-write, Letta’s agent-driven append, Zep’s entity-extraction pipeline — all three are write-policy variants of the same four-stage pipeline that article covers.