$ cat ai-engineering/memory-evaluation.md

Memory Evaluation: Benchmarks and Custom Evals

Memory evaluation for agents: LoCoMo and LongMemEval, multi-hop recall, contradiction handling, and how to design a custom eval that catches drift.

Jatin Bansal@blog:~/ai-engineering$ open memory-evaluation

A team has spent six months building an agent memory layer. Episodes write cleanly; the retrieval policy blends recency, importance, and similarity; the knowledge graph tracks bi-temporal facts; the reflection loop turns episodes into beliefs nightly. The PM asks the question every PM asks: “is it any good?” The team has zero numbers. Internal dashboards count writes, reads, p95 latency, vector-store size. None of those answer the question. A demo lands well — the agent remembers the user’s dog’s name from three sessions ago — but a demo is one trace, not a measurement. A week later, an A/B ships a write-policy change. Recall on the demo trace still works; nobody notices that contradiction resolution silently regressed from “agent updates the stored belief” to “agent stores both contradictory beliefs and surfaces the older one.” This is where every memory framework either has a benchmark story or is shipping blind.

Opening bridge

Yesterday’s piece on memory privacy and multi-tenancy closed the governance axis. Every layer of the stack — write policies, episode segmentation, reflection, retrieval rerank, knowledge graph — has been built. None of it has been measured. The long-term memory piece flagged LongMemEval and LoCoMo and deferred the deep dive; the knowledge-graph piece cited the Zep paper’s 18% lift without working out what the categories probe; the retrieval-policies piece named the same benchmarks and stopped. Today’s article is that deep dive: the public benchmarks, the design constraints they impose, the categories where every framework still fails, and the custom-eval discipline that catches the workload-specific failure modes public benchmarks miss.

Definition

A memory eval is a fixed corpus of multi-session interactions plus a question/answer set with ground truth, run against a memory system that ingested the interactions first, scored on the system’s recall/correctness across categories that probe different memory abilities. Four properties separate it from a standard RAG eval. First, multi-session — the corpus spans many conversations with temporal gaps; a single 50-turn transcript measures long-context behavior, not memory. Second, ingest-then-query separation — the memory system writes during ingestion with no access to the questions; the questions arrive later and probe what the system chose to retain. Third, category-aware scoring — single-hop, multi-hop, temporal, knowledge-update, abstention each scored separately; a single aggregate averages over critical failure modes. Fourth, budget-aware — the eval reports tokens consumed, p50/p95 latency, and dollars per query alongside accuracy; a system at 95% by stuffing every past conversation into the prompt is not the same product as one at 92% at 1.8K tokens per query.

What a memory eval is not. It is not RAG evaluation — RAG evals score retrieval against a static corpus; memory evals score retrieval against a corpus the system built itself during ingestion. It is not a needle-in-a-haystack test — NIAH measures whether a long-context model can find a literally-matching fact in its prompt; memory evals measure whether a system can recall a paraphrased fact across sessions.

Intuition

The mental model that pays off is memory evaluation is two-phase by construction, and the phase boundary is where most home-grown evals leak. Phase one is ingest — feed the memory system the full multi-session corpus and let it run its write policy, segmentation, reflection, and compression passes. The system can’t peek at the questions to decide what to retain. Phase two is query — only after ingest does the eval ask questions, scoring responses against ground-truth answers.

The leak that ruins most internal evals: feeding the corpus and the questions together, so the system retains exactly the bits the questions probe. The leak looks like 95% accuracy and production looks like 40%. The public benchmarks all enforce the phase boundary because they were burned by this exact pattern. Get the phase boundary wrong and your numbers are noise.

Three signals separate a real memory eval from a vibes check. First, category breakdown — an agent at 80% single-hop and 6% multi-hop is a different product from one at 50% on both. Second, budget axes alongside accuracy — Mem0’s 92.5% on LoCoMo at ~1.8K tokens per query versus 26K for full-context baselines is three numbers, not one. Third, abstention included — the eval tests whether the system declines when the store doesn’t contain the fact. Without this, the confident-but-wrong failure mode (the agent confabulates a “remembered” preference that was never stated) is invisible in the metric.

The distributed-systems parallel

The clean analogue is database benchmark suites — TPC-C, TPC-H, YCSB — applied to a system whose contents the benchmark itself populated. The load phase is part of the benchmark (TPC-H measures how fast queries run against the data the load wrote — a memory write-policy that extracts every fact into a graph pays a heavy ingest cost the benchmark either credits or doesn’t). Workload categories matter more than the aggregate (TPC-C’s per-transaction-class throughput is LongMemEval’s per-ability category breakdown). Synthetic-but-representative beats trying to use real data (TPC-C’s transaction mix is not real shopping; LoCoMo’s conversations are LLM-generated and human-verified, not real chats).

The disanalogy: TPC has fifteen years of vendor cooperation calibrating the suite; memory benchmarks are two years old and drift quarterly. Mem0’s LoCoMo numbers went from 66.9% in 2025 to 92.5% in 2026 — partly system improvement, partly protocol stabilization. Treat absolute numbers as a moving target; the category breakdown shape is the durable signal.

Mechanics: the public benchmarks

Five benchmarks define the field in 2026. Each probes a different facet. A defensible memory system has numbers on at least the first two; serious teams run all five.

LoCoMo (Maharana et al., 2024). Fifty multi-session conversations between two fictional speakers, each spanning up to 32 sessions and averaging ~600 turns (~16K tokens), with timestamps and multimodal context. The 1,540-question QA set covers five categories: single-hop (intra-session recall), multi-hop (cross-session synthesis), temporal reasoning, open-domain, and adversarial (questions that should be refused). The paper reports GPT-4 at ~32.1 F1 against a human ceiling of ~87.9 — the field’s headline gap. Recent framework numbers: Mem0’s 2026 token-efficient algorithm reports 92.5% on LoCoMo; the 2025 Mem0 paper reports 66.9% vector-only and 68.4% with the Mem0g graph extension, both versus 52.9% for OpenAI’s built-in memory feature. The wide range reflects different LLM judges, prompts, and ingest pipelines — read the protocol before comparing.

LongMemEval (Wu et al., ICLR 2025). Five hundred manually-curated questions embedded in freely-scalable chat histories, probing five memory abilities: information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention. The construction is more rigorous than LoCoMo’s — each question is human-authored, with histories scalable to test ingest-time behavior at multiple sizes. The paper documents commercial chat assistants dropping ~30% on sustained interactions versus single-session baselines. Zep’s January 2025 paper reports 71.2% overall on LongMemEval with GPT-4o (versus 60.2% for vanilla full-context at 29s latency), with the largest gains in temporal-reasoning (+17.3pp) and multi-session (+13.6pp). Mem0’s 2026 algorithm reports 94.4% on LongMemEval.

MemoryAgentBench (Hu et al., 2025). Probes four cognitive competencies — accurate retrieval, test-time learning, long-range understanding, and selective forgetting. The selective-forgetting category finally exposed the contradiction-resolution failure mode every framework had been papering over: all paradigms fail dramatically on multi-hop conflict resolution, with best accuracy at or below 6% on CR-MH (Contradiction Resolution, Multi-Hop) even in the most advanced models. Run it if your product depends on the agent updating beliefs as facts change.

BEAM (Mem0’s State of AI Agent Memory 2026). Scales to 1M and 10M token regimes, with ten categories including preference following, knowledge update, event ordering, abstention, and contradiction resolution. Mem0’s 2026 numbers: 64.1 and 48.6 on BEAM 1M/10M. The contradiction-resolution and event-ordering categories are where production systems silently regress as corpora grow.

∞Bench (Zhang et al., 2024), NoLiMa (Modarressi et al., ICML 2025), and RULER (Hsieh et al., 2024) are long-context evals, not memory evals — they measure whether a model can find a fact buried in a 100K+ or 32K+ token prompt, not whether a system can recall across sessions. They calibrate the long-context fallback: NoLiMa’s finding that 11 of 13 tested 128K-claiming models drop below 50% of their short-context baseline at 32K is the empirical case for memory-as-a-system over context stuffing.

A defensible benchmark story: numbers on LoCoMo and LongMemEval for the headline; MemoryAgentBench or BEAM if contradiction and selective forgetting matter; ∞Bench/NoLiMa/RULER numbers to justify why the system uses retrieval rather than bigger context.

Mechanics: the metric framework

Two metric families do the work. Precision/recall/F1 over retrieved memories measures the retrieval substrate — given a query, how many of the top-K retrieved episodes are actually relevant, and how many of the relevant episodes in the store made it into the top-K? End-to-end answer correctness (LLM-as-judge, sometimes string-match against a normalized gold) measures whether the whole system — retrieval plus generation — answers right.

Both matter and fail differently. A retrieval substrate at 90% precision/recall with a generator that fumbles synthesis scores low end-to-end. A substrate that retrieves garbage paired with a generator whose parametric memory carries the answer scores high. The Memory for Autonomous LLM Agents survey (2025) lays out the taxonomy; the production rule is measure both, separately, and pay attention when they diverge.

The precision/recall mechanics, for episodic recall against a known gold:

Top-K retrieval: the system returns K episodes for a query.
Gold-relevant set: the eval pre-computes which episodes are relevant. For LoCoMo this is human-annotated; for a custom eval, you build it.
Precision@K: (relevant retrieved) / K. Recall@K: (relevant retrieved) / (total relevant in store). F1@K: harmonic mean.

Pair these with temporal-correctness for the temporal category — “what did the agent believe last Tuesday?” should return the belief that held as of last Tuesday, not the current one. Bi-temporal stores like Graphiti score this category much higher than vector-only stores.

For contradiction resolution, the metric is supersession behavior: when two episodes claim contradictory facts, does the answer reflect the most recent (or most-trusted) claim, or average / pick at random? MemoryAgentBench CR-MH scores this; for a custom eval, the simplest version is hand-built “fact A at session N; ¬A at N+5; question at N+10” tuples.

Code: a precision/recall harness (Python)

Phase-separated ingest-then-query, category-tagged questions, per-category metrics with token-cost alongside. Store-interchangeable across Mem0, LangGraph store, Letta, or a homegrown vector store.

python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
# pip install mem0ai openai
from dataclasses import dataclass, field
from typing import Any, Callable, Literal
import json, time
from collections import defaultdict

Category = Literal["single_hop", "multi_hop", "temporal", "knowledge_update", "abstention"]

@dataclass
class EvalQuestion:
    qid: str
    user_id: str
    question: str
    gold_answer: str | None  # None for abstention category
    relevant_episode_ids: set[str]  # ground truth for retrieval metrics
    category: Category

@dataclass
class IngestEvent:
    user_id: str
    text: str
    episode_id: str
    timestamp: float  # seconds since epoch; preserves multi-session structure

@dataclass
class EvalRun:
    by_category: dict[Category, list[dict]] = field(default_factory=lambda: defaultdict(list))

    def add(self, cat: Category, **fields):
        self.by_category[cat].append(fields)

    def summary(self) -> dict:
        out = {}
        for cat, rows in self.by_category.items():
            if not rows: continue
            n = len(rows)
            out[cat] = {
                "n": n,
                "precision_at_k": sum(r["precision"] for r in rows) / n,
                "recall_at_k": sum(r["recall"] for r in rows) / n,
                "f1_at_k": sum(r["f1"] for r in rows) / n,
                "answer_correct": sum(r["answer_correct"] for r in rows) / n,
                "avg_tokens": sum(r["tokens"] for r in rows) / n,
                "avg_latency_ms": sum(r["latency_ms"] for r in rows) / n,
            }
        return out


def f1(p: float, r: float) -> float:
    return 0.0 if (p + r) == 0 else 2 * p * r / (p + r)


def run_eval(
    ingest_events: list[IngestEvent],
    questions: list[EvalQuestion],
    memory_write: Callable[[IngestEvent], None],
    memory_search: Callable[[str, str, int], list[dict]],     # (user_id, query, k) -> [{episode_id, text, score}]
    answer_with_context: Callable[[str, list[str]], tuple[str, int]],  # (q, retrieved_texts) -> (answer, tokens_used)
    judge_correct: Callable[[str, str | None, Category], bool],  # (pred, gold, category) -> bool
    k: int = 5,
) -> EvalRun:
    # Phase 1: ingest. The memory system has no access to the questions here.
    for ev in sorted(ingest_events, key=lambda e: e.timestamp):
        memory_write(ev)
    # Phase 2: query. Per-question retrieval, scoring, and category attribution.
    run = EvalRun()
    for q in questions:
        t0 = time.perf_counter()
        retrieved = memory_search(q.user_id, q.question, k)
        retrieved_ids = {r["episode_id"] for r in retrieved}
        relevant_retrieved = retrieved_ids & q.relevant_episode_ids
        # Edge: abstention questions have no relevant episodes; recall is 1 by convention,
        # precision is 0 if any episodes were retrieved (system should have abstained on empty).
        if q.category == "abstention":
            precision = 0.0 if retrieved else 1.0
            recall = 1.0
        else:
            precision = (len(relevant_retrieved) / len(retrieved_ids)) if retrieved_ids else 0.0
            recall = (len(relevant_retrieved) / len(q.relevant_episode_ids)) if q.relevant_episode_ids else 0.0
        answer, tokens = answer_with_context(q.question, [r["text"] for r in retrieved])
        correct = judge_correct(answer, q.gold_answer, q.category)
        run.add(
            q.category,
            qid=q.qid,
            precision=precision,
            recall=recall,
            f1=f1(precision, recall),
            answer_correct=1.0 if correct else 0.0,
            tokens=tokens,
            latency_ms=(time.perf_counter() - t0) * 1000.0,
        )
    return run

The memory_write and memory_search callables are the only contract — the same harness scores Mem0, LangGraph, and a homegrown vector store identically. For abstention questions, the judge should accept any refusal phrasing as correct when the gold is None. EvalRun.summary() is the publishable artifact: per-category precision, recall, F1, end-to-end correctness, and token/latency budget in one dict.

Code: the same harness in TypeScript

typescript

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
// npm i ai @ai-sdk/openai  (judge call wires into generateText from the Vercel AI SDK)
type Category = "single_hop" | "multi_hop" | "temporal" | "knowledge_update" | "abstention";

interface EvalQuestion {
  qid: string; userId: string; question: string;
  goldAnswer: string | null; relevantEpisodeIds: Set<string>; category: Category;
}
interface IngestEvent { userId: string; text: string; episodeId: string; timestamp: number; }
interface RetrievedEpisode { episodeId: string; text: string; score: number; }

const f1 = (p: number, r: number) => (p + r === 0 ? 0 : (2 * p * r) / (p + r));

export async function runEval(args: {
  ingestEvents: IngestEvent[]; questions: EvalQuestion[];
  memoryWrite: (ev: IngestEvent) => Promise<void>;
  memorySearch: (userId: string, query: string, k: number) => Promise<RetrievedEpisode[]>;
  answerWithContext: (q: string, retrieved: string[]) => Promise<{ answer: string; tokens: number }>;
  judgeCorrect: (pred: string, gold: string | null, cat: Category) => Promise<boolean>;
  k?: number;
}) {
  const k = args.k ?? 5;
  // Phase 1: ingest in timestamp order — preserves multi-session structure.
  for (const ev of [...args.ingestEvents].sort((a, b) => a.timestamp - b.timestamp)) {
    await args.memoryWrite(ev);
  }
  // Phase 2: per-question retrieval + scoring.
  const rows = new Map<Category, any[]>();
  for (const q of args.questions) {
    const t0 = performance.now();
    const retrieved = await args.memorySearch(q.userId, q.question, k);
    const ids = new Set(retrieved.map((r) => r.episodeId));
    const hits = new Set([...ids].filter((id) => q.relevantEpisodeIds.has(id)));
    const precision = q.category === "abstention"
      ? (retrieved.length === 0 ? 1 : 0)
      : (ids.size ? hits.size / ids.size : 0);
    const recall = q.category === "abstention"
      ? 1
      : (q.relevantEpisodeIds.size ? hits.size / q.relevantEpisodeIds.size : 0);
    const { answer, tokens } = await args.answerWithContext(q.question, retrieved.map((r) => r.text));
    const correct = await args.judgeCorrect(answer, q.goldAnswer, q.category);
    (rows.get(q.category) ?? rows.set(q.category, []).get(q.category)!).push({
      precision, recall, f1: f1(precision, recall),
      answerCorrect: correct ? 1 : 0, tokens, latencyMs: performance.now() - t0,
    });
  }
  // Aggregate per category.
  const out: any = {};
  for (const [cat, r] of rows) {
    const avg = (k: string) => r.reduce((a, x) => a + x[k], 0) / r.length;
    out[cat] = { n: r.length, precisionAtK: avg("precision"), recallAtK: avg("recall"),
                 f1AtK: avg("f1"), answerCorrect: avg("answerCorrect"),
                 avgTokens: avg("tokens"), avgLatencyMs: avg("latencyMs") };
  }
  return out;
}

Same metrics, same phase boundary, same abstention edge case, same per-category aggregation as the Python version. The judge plugs into generateText from the Vercel AI SDK or equivalent.

Designing a custom eval

Public benchmarks are necessary, not sufficient. They probe generic conversational memory; they don’t probe your product’s recall patterns. A customer-support agent has return-policy preferences and account-tier context that LoCoMo doesn’t touch; a medical-history agent has drug interactions and prior diagnoses that LongMemEval doesn’t touch. The custom eval is where workload-specific failure modes show up.

The minimum shape:

20-50 real or near-real multi-session traces. Real if privacy permits; otherwise synthetic in the product’s actual conversational shape. Curated 20-50 beats lazy 500.
Per-trace question set, 10-20 questions. Tagged by the LongMemEval five-ability frame. Each has a gold answer (or None for abstention) and relevant-episode IDs for retrieval metrics.
An ingest-then-query harness. The Python/TS code above is the template.
A weekly cron run. Every memory-policy change passes the eval before shipping. A per-category regression is blocking even if the aggregate holds.
A growing adversarial subset. Every production incident that traced to a memory bug becomes a regression question. After six months this set is more valuable than the seed corpus.

The trap to avoid: scoring only end-to-end answer correctness and ignoring retrieval precision/recall. If retrieval is at 30% recall but the generator’s parametric knowledge masks it for 80% of questions, the system will fail catastrophically the moment the question distribution shifts. Retrieval metrics are the leading indicator; end-to-end is the lagging one.

For Ragas-style metrics applied to memory: context precision and context recall translate directly, and faithfulness catches “generator confabulates a memory.” The multi-session-specific categories (temporal correctness, contradiction resolution) diverge; Ragas doesn’t have those out of the box.

Trade-offs, failure modes, gotchas

LLM-as-judge bias. Judges have position bias, verbosity bias, and judge-model bias (a GPT-4 judge scores GPT-4 answers slightly higher than Claude, and vice versa). Use the same judge across systems being compared and report the judge alongside the score. Mem0’s 92.5% and Zep’s 71.2% are not directly comparable until you confirm both used the same judge and prompt. The LLM-as-judge article is the deep dive on each of these bias modes and the mitigations (both-orderings, cross-family panels, calibration against a human reference); the Memory for Autonomous LLM Agents survey is the memory-specific reference.

Contradiction-resolution silent regression. MemoryAgentBench’s ≤6% on multi-hop conflict resolution is the canary. Production systems miss it because contradictions are rare in demos; they show up at the multi-month timescale. Include a synthetic contradiction subset in the custom eval and re-evaluate after every write-policy change. The conflict article names the supersession-versus-deletion distinction; the eval verifies it actually happened.

The “we beat the benchmark” hill-climb. Once a benchmark exists, frameworks optimize for it. Mem0’s 66.9% → 92.5% over a year is partly improvement, partly hill-climbing. Category-aware reporting (improving single-hop while regressing multi-hop is not better) and custom-eval discipline are the mitigations.

Ingest-cost-versus-query-cost asymmetry. Frameworks that pay heavily at write time (graph extraction, reflections) score well on multi-hop and have cheap per-query budgets — the trade Mem0 and Zep both advertise. But ingest cost is often invisible in the eval. Custom evals should report total cost (ingest + query) per user-month.

Protocol-drift. Two teams running “LoCoMo” with different judges, prompts, and ingest pipelines are reporting non-comparable numbers. Publish the protocol alongside the score — judge model, judge prompt, system prompt, ingest chunking, retrieval K, timestamps.

Context-rot interaction with long-context fallback. Chroma’s context rot experiments and NoLiMa both document non-uniform degradation as input length grows. A memory system at 1.8K tokens per query may outperform a long-context system at 26K even with the same information, simply because the model uses the small context better. Measure both on the same eval and the trade-off becomes the actual decision.

The “we ran it once” gotcha. The eval is a regression suite. Every PR touching the write policy, segmentation, reflection prompt, or retrieval policy runs it with category-aware thresholds. Inspect AI and Promptfoo are useful infrastructure; the policy is the same as database-benchmark-regression-on-PR that mature data systems already run.

What to read next

Production Memory Frameworks: MemGPT/Letta, mem0, Zep, Graphiti — the capstone the eval discipline calibrates. The 2026 benchmark numbers each framework publishes (Mem0’s 94.4% LongMemEval, Zep’s 71.2%, the BEAM 1M/10M results) only make sense against the per-category breakdown and protocol-drift discussion in this article.
Eval-Driven Development for LLM Systems — the general workflow this article specializes for memory. The error-analysis-first loop, golden-set construction, and the test-pyramid layering apply to every LLM application; this piece is the memory specialization of that broader practice.
LLM-as-Judge: Pointwise and Pairwise — the deep dive on the judge that powers every cross-framework memory benchmark in this article. The Mem0/Zep/MemoryAgentBench numbers all sit on top of a judge whose choice changes the result; this piece walks the rubric design, bias modes, and human-calibration loop.
RAG Evaluation: Recall, Faithfulness, Answer Quality — the sibling discipline. Memory eval inherits retrieval-substrate metrics and generation-side metrics from RAG eval; multi-session-specific categories (temporal, contradiction, abstention) are where the two diverge.