$ cat ai-engineering/production-memory-frameworks.md

Production Memory Frameworks: MemGPT/Letta, mem0, Zep, Graphiti

MemGPT/Letta, mem0, Zep, and Graphiti compared on architecture, write/read paths, benchmarks, and the build-versus-buy decision for production memory.

Jatin Bansal@blog:~/ai-engineering$ open production-memory-frameworks

A team has shipped twenty memory primitives and now has to pick what actually runs in production. Vector store, episodic log, bi-temporal graph, reflection job, conflict resolver, multi-tenant scoping, eval harness — all twenty are real pieces of code with real maintenance budgets. The question every team eventually asks is the unglamorous one: do we keep building these ourselves, or adopt a framework that has converged on most of them? The four frameworks that have absorbed most of the design space — Letta (the MemGPT productization), mem0, Zep, and Graphiti — each ship a different opinion about which primitives are load-bearing. Picking the right one is downstream of understanding what each is optimized for; the wrong choice locks you into a write pipeline you cannot easily migrate off. This piece is the capstone of the memory subtree: the comparison matrix, the build-versus-buy decision, the integration patterns, and the failure modes each frame still has in mid-2026.

Opening bridge

Yesterday’s piece on memory evaluation closed the measurement axis: LoCoMo, LongMemEval, BEAM, MemoryAgentBench, the per-category breakdown, the precision/recall harness, the contradiction-resolution category where every framework still scores under 6%. The eighteen articles before it walked the mechanics — write policies, retrieval policies, hierarchical memory, knowledge graphs, conflict and forgetting, sleep-time compute. Every one of those articles flagged the same forward reference: the production frameworks article will work the comparison matrix. Today’s piece is that matrix. The memory subtree closes here; the next subtree (Evaluation) opens with eval-driven development, which generalizes the eval discipline this article uses to compare frameworks into the workflow every LLM application needs.

Definition

A production memory framework is a runtime that bundles the write pipeline, storage substrate(s), read pipeline, multi-tenancy primitives, and maintenance passes into a single SDK. A substrate (pgvector, Qdrant) is unopinionated; a framework picks an episode shape, a write gate, a retrieval blend, a tier policy, and a tenant model, then exposes them as a coherent add/search/update/delete API. Adopting a framework is buying its opinions.

Four define the field in 2026. MemGPT/Letta is the productized version of the original MemGPT paper — three-tier hierarchical memory (core/recall/archival) with the agent self-managing tier promotion via tool calls. mem0 is the distill-at-write vector layer with an optional graph extension (Mem0g); the LLM-gated fact-extraction pipeline runs on every add. Zep is the graph-first hybrid: a bi-temporal knowledge graph wraps vector and BM25 indexes, all retrieval fused, no LLM in the read path. Graphiti is Zep’s open-source temporal-graph engine, usable standalone when you want the bi-temporal substrate without the cloud product on top.

Intuition

The mental model that compresses the four-way comparison: each framework optimizes a different point on the write-cost-versus-read-cost curve, and the right choice is determined by which side of that curve your workload’s hot path sits on.

Letta pays the least at write time and lets the agent decide what to promote. The framework gives you the tier topology and the tool surface (core_memory_append, archival_search); the policy lives in the prompt.
mem0 pays heavily at write time (one or two LLM calls per turn to extract facts) so reads stay vector-only and cheap. Mem0g adds a parallel graph write for relational queries; the read still stays LLM-free.
Zep pays the most at write time — entity extraction, relation extraction, deduplication, bi-temporal stamping, contradiction detection — and runs a pure-traversal read. All the model work happens during ingest.
Graphiti is the substrate Zep is built on, exposed standalone. Same bi-temporal graph, same write cost, no cloud product around it.

The right framework is the one whose write/read asymmetry matches your workload’s traffic pattern. A high-write, low-read background job benefits least from mem0/Zep’s write-heavy designs; a chatbot with 10 reads per write benefits most.

The distributed-systems parallel

The four frameworks line up cleanly against four database-design archetypes. Letta is an in-memory store with paging to disk — the agent’s context is the hot tier, recall and archival are disk, tool calls are page faults; see the hierarchical memory article for the deep dive. mem0 is a denormalized read-optimized store with an LLM-driven ETL on the write path — facts as materialized view; the OLAP read-optimized cube pattern. Zep is a graph database with vector and BM25 secondary indexes — graph as source of truth, auxiliary indexes as fuzzy fallback; the bi-temporal columns (valid_time and transaction_time) are the audit-heavy transactional database pattern lifted directly. Graphiti is the embedded engine version of the same graph database, the way SQLite is to Postgres — same data model, hosted yourself, fewer batteries included.

The disanalogy: database systems have decades of vendor stability; these frameworks ship breaking changes monthly. Treat the API contracts in this article as approximately right for mid-2026; read the current docs before integration.

The comparison matrix

Dimension	Letta (MemGPT)	mem0	Zep	Graphiti
Primary substrate	Hierarchical (core/recall/archival)	Vector + optional graph	Graph + vector + BM25	Bi-temporal graph
Write path cost	Low (DB write + tool call)	High (LLM fact extraction per turn)	Very high (entity + relation extraction + bi-temporal stamping)	Very high (same as Zep)
Read path cost	Low (tool calls, no LLM in core)	Low (vector search + optional graph traversal)	Low (pure traversal + RRF, no LLM)	Low (pure traversal + RRF)
Bi-temporal	No (single transaction clock)	No	Yes (valid + transaction time)	Yes
Self-managed by agent	Yes (agent calls tier-promotion tools)	No (harness-driven)	No (harness-driven)	No
Multi-tenancy	Per-agent state (built-in)	`user_id` required parameter	`user_id` / `session_id` (built-in)	`group_id` namespace
Hosted option	Letta Cloud + self-host	mem0 Cloud + open-source	Zep Cloud + open-source community edition	Self-host only
2026 benchmark anchor	~83% LongMemEval (community report)	94.4% LongMemEval (token-efficient algorithm, mem0.ai/research)	71.2% LongMemEval (Zep paper)	Same substrate as Zep
Best-fit workload	Long-running stateful agents where the agent itself manages context	Chatbots and assistants with high user-fact density	CRM, compliance, healthcare — relational + temporal queries dominate	Greenfield graph-first builds, self-hosted
Worst-fit workload	Stateless or short-lived agents	Workloads where raw episodes matter more than distilled facts	Workloads with no relational structure	Same as Zep, plus teams who want a managed product

The benchmark numbers in that row are the most volatile entry in the table. Mem0’s LoCoMo score went from 66.9% in 2025 to 92.5% in 2026 — partly genuine algorithm improvement, partly protocol stabilization, partly hill-climbing. Read protocols (judge model, ingest pipeline, top-K) before comparing across rows. The memory evaluation article is the deep dive on why direct cross-framework comparison is harder than it looks.

Build versus buy

Reach for a framework when (a) your team has fewer than two engineers who can own a memory subsystem long-term, (b) your workload fits within ±20% of one of the four frameworks’ opinions, and (c) you don’t have an existing storage layer the framework would fight. Hand-roll when (a) you have those engineers, (b) your defining write or read pattern isn’t covered (per-document write policies for a legal-research agent, custom segmentation for a code-review agent, or a graph schema that doesn’t fit Graphiti’s entity-relation model), or (c) you already operate a vector store and a graph store and the framework’s opinions about both fight your data model.

The most common mistake: adopting a framework, then writing so much code around it to make it fit that you would have been better off rolling your own. mem0 and Letta both have escape hatches — custom prompts, overrideable extraction, custom tools — but every escape hatch is a place the next breaking change will land. If you find yourself overriding more than two defaults, the framework is wrong for your workload.

The escape from the binary: roll your own on a store primitive and adopt a framework only for the layer where its opinions are load-bearing. LangGraph stores give you a tuple-namespaced KV/vector store; you build the write policy, retrieval blend, and tier topology yourself, and adopt Graphiti only for the graph layer if your workload needs bi-temporal queries. Most teams converge on this after a year.

Integration pattern 1: Letta (Python)

The Letta integration pattern is the framework owns the agent state, the application owns the message routing. You call the SDK with messages; Letta manages the memory blocks, the recall/archival tiers, and the persistence behind a client.agents.create / client.agents.messages.create surface. Install: pip install letta-client and run a local server (docker run -d -p 8283:8283 letta/letta:latest) or use Letta Cloud.

python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# pip install letta-client
from letta_client import Letta

client = Letta(base_url="http://localhost:8283")

# Create an agent with hierarchical memory: core memory blocks are always in context,
# recall and archival tiers live outside the prompt and the agent pages them in via tools.
agent = client.agents.create(
    model="openai/gpt-4o-mini",
    embedding="openai/text-embedding-3-small",
    memory_blocks=[
        {"label": "persona", "value": "You are a customer-support agent for Acme.",
         "limit": 1000},
        {"label": "human", "value": "Unknown user.", "limit": 2000},
    ],
    tools=["web_search"],  # core_memory_append, recall_search, archival_search are auto-injected
)

# Each call returns the full step trace; Letta has already updated memory blocks,
# inserted into recall/archival, and persisted state. No explicit write call needed.
reply = client.agents.messages.create(
    agent_id=agent.id,
    messages=[{"role": "user", "content": "I'm Priya, my order #4321 hasn't shipped."}],
)
assistant_msg = next(m for m in reversed(reply.messages) if m.message_type == "assistant_message")
print(assistant_msg.content)

# Inspect the updated memory blocks — the agent should have written "Priya" and "#4321"
# into the human block via core_memory_append on its own.
state = client.agents.retrieve(agent_id=agent.id)
for block in state.memory.blocks:
    print(block.label, "→", block.value)

The key property: there is no explicit memory.add() call. The agent decides what to remember, and the framework records what it decided. This is Letta’s central opinion — agent-driven memory management — and it is either exactly right (long-running stateful agents that learn from their interactions) or exactly wrong (workflows where the harness, not the agent, owns the write policy).

Integration pattern 2: mem0 (TypeScript)

The mem0 integration pattern is the inverse: the application owns the message routing, mem0 owns the memory write/read on every turn. You call memory.add(messages, { userId }) after each user turn and memory.search(query, { userId }) before each model call. The framework extracts facts during add and serves the relevant subset during search. Install: npm install mem0ai (open-source mode) or use Mem0 Cloud.

typescript

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
// npm install mem0ai
import { Memory } from "mem0ai/oss";

// Vector store + optional graph (Mem0g). Set OPENAI_API_KEY in the environment.
const memory = new Memory({
  vectorStore: { provider: "qdrant", config: { collectionName: "agent_memory", host: "localhost", port: 6333 } },
  // Uncomment to enable Mem0g — adds graph store for relational queries.
  // graphStore: { provider: "neo4j", config: { url: "bolt://localhost:7687", username: "neo4j", password: "password" } },
});

// Write: pass the conversation; mem0 runs LLM fact extraction internally
// and stores the distilled claims, not the raw turns.
await memory.add(
  [
    { role: "user", content: "I'm Priya. Order #4321 hasn't shipped yet." },
    { role: "assistant", content: "Hi Priya, let me check on order #4321." },
  ],
  { userId: "user_priya_123" },
);

// Read: before the next model call, pull the relevant memories for the user.
// search() returns the distilled facts mem0 chose to store.
const results = await memory.search("What's the user's name and pending order?", {
  userId: "user_priya_123",
  limit: 5,
});
for (const r of results.results) {
  console.log(`[score=${r.score.toFixed(2)}]`, r.memory);
}

// Inject results into the next system prompt. The harness, not the framework,
// owns this rendering step — mem0 returns the facts; you decide how to use them.
const memoryContext = results.results.map((r) => `- ${r.memory}`).join("\n");
const systemPrompt = `## Known facts about the user:\n${memoryContext}\n\nReply to the user.`;
// ... pass systemPrompt to your LLM SDK of choice.

The opinion this framework ships: the unit of long-term memory is the distilled fact, not the raw turn. If you want the raw turns preserved verbatim, mem0 fights you — that’s not what the framework optimizes for. The escape hatch is memory.add with infer: false, which skips extraction and stores the raw text, but that path is not what the LoCoMo and LongMemEval numbers in the marketing are measured on.

Integration pattern 3: Graphiti (Python sketch)

Graphiti’s contract is give me episodes with timestamps, I’ll give you a temporally-correct knowledge graph and a fused vector+BM25+graph retriever. Install: pip install graphiti-core and run Neo4j (docker run -d -p 7687:7687 -e NEO4J_AUTH=neo4j/password neo4j:5).

python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# pip install graphiti-core neo4j
import asyncio
from datetime import datetime
from graphiti_core import Graphiti

graphiti = Graphiti(
    uri="bolt://localhost:7687", user="neo4j", password="password",
)

async def main():
    await graphiti.build_indices_and_constraints()

    # add_episode triggers entity extraction, relation extraction, dedup,
    # bi-temporal stamping, and contradiction detection in one call.
    await graphiti.add_episode(
        name="march_check_in",
        episode_body="Priya is my manager. She approved the Q1 forecast on March 15.",
        source_description="user message",
        reference_time=datetime(2026, 3, 15),
        group_id="tenant_acme",  # multi-tenant scope
    )
    await graphiti.add_episode(
        name="april_update",
        episode_body="Devansh is my new manager starting April 15. Priya moved to a different team.",
        source_description="user message",
        reference_time=datetime(2026, 4, 15),
        group_id="tenant_acme",
    )

    # The read path: search() is hybrid (graph traversal + vector + BM25 fused with RRF),
    # no LLM call in the loop. Filter by group_id for tenant isolation.
    results = await graphiti.search(query="Who manages this user as of March 20?", group_ids=["tenant_acme"])
    for edge in results:
        print(edge.fact, "valid:", edge.valid_at, "->", edge.invalid_at)

asyncio.run(main())

The bi-temporal property is load-bearing: the “as of March 20” query returns Priya (the manager as-of that date), not Devansh (the current one). Vector-only stores cannot answer that question correctly regardless of how their retrieval is scored. If your workload doesn’t include point-in-time queries, this property is wasted complexity; if it does, no other framework ships it as a first-class concept.

Trade-offs, failure modes, gotchas

Letta — the over-pinned-core failure mode. An agent with no demotion discipline grows its core blocks until the prompt becomes attention-thin (the lost-in-the-middle effect) or hits the context limit. Per-block limit is a structural mitigation; the looser failure mode is the agent that pins everything because it doesn’t know what to demote. The hierarchical memory article covers this.

mem0 — the fact-extraction-bias failure mode. The write-time LLM call is what makes the read cheap; it is also where information the conversation only made implicit gets lost. Sarcasm, conditionals (“if it rains, I’ll skip the meeting”), and multi-turn negotiations all flatten badly through fact extraction. Fix: layer mem0 over a raw episode log — mem0 for facts, your own table for raw turns — and read from both depending on query type.

Zep / Graphiti — entity-extraction cost. The write path is dominated by small-model calls for entity and relation extraction; at scale that’s one to three calls per episode. Mitigations are the standard expensive-write playbook: batch when latency tolerates, skip extraction for low-value episodes, budget explicitly. The write policies article is the triage-stage deep dive.

Benchmark numbers are non-comparable across protocols. Mem0’s 94.4% LongMemEval and Zep’s 71.2% LongMemEval used different judges, prompts, and ingest pipelines. Treat the numbers as within-framework deltas, not cross-framework rankings. The evaluation article works through the protocol-drift specifics; the LLM-as-judge article is the deep dive on the judge biases (position, verbosity, self-preference) that drive a chunk of the gap.

Lock-in cost of distillation. mem0’s distill-at-write locks you into its extraction logic — migration means replaying conversations through the new framework’s extractor (expensive) or losing accumulated facts (lossy). Letta’s recall tier and Graphiti’s episode log both preserve raw episodes alongside derived structure, giving a cleaner exit path.

Conflict resolution is uniformly weak. MemoryAgentBench’s multi-hop contradiction resolution stays under ~6% across all four. None ship a robust solution; the conflict-and-forgetting article covers the supersession-versus-deletion patterns you layer on top. Bi-temporal substrates (Zep/Graphiti) at least give you the data model.

Sleep-time-compute compatibility is uneven. Letta ships sleep-time agents as a first-class concept; mem0 and Zep support background consolidation but the pattern is more DIY. The sleep-time-compute article covers when this matters — and for high-throughput multi-tenant workloads where idle time is scarce, none of the frameworks ship a perfect answer.

What to read next

Memory Evaluation: Benchmarks and Custom Evals — the measurement layer that calibrates every framework comparison in this piece. Before adopting a framework on the strength of its published numbers, run the protocols there against your workload.
Hierarchical Memory: Working / Episodic / Semantic Tiers — the architecture that the MemGPT/Letta side of the matrix instantiates. The OS-paging model, the core/recall/archival tier definitions, and the promotion/demotion policies are the substrate Letta productizes.
Knowledge Graphs as Structured Memory — the architecture that the Zep/Graphiti side of the matrix instantiates. The bi-temporal model, the hybrid graph+vector retrieval, and the entity-extraction write path covered there are what Graphiti operationalizes.
Eval-Driven Development for LLM Systems — the forward step. Once you’ve chosen a framework on a benchmark number, the eval suite is what catches the regression when the next breaking change ships. The error-analysis-first workflow and the test-pyramid layering generalize the discipline this article applies to framework comparison.