Temporal Reasoning and Memory Provenance
Temporal reasoning and provenance in agent memory: as-of queries, bi-temporal validity, dated claims, staleness gates, and per-fact source audit trails.
A finance research agent answers a question on a Tuesday afternoon: “What’s our position on the Acme deal?” It pulls back a confident summary — the user is bullish, the model is the v3 forecast, the expected close date is end of Q2. Every claim is grounded in a retrieved episode; every retrieval ranked well by the recency-importance-similarity blend. The user reads the summary, frowns, and types back: “That was Q1. We pivoted in April. Where are you getting this?” The agent has no answer. The episodes it retrieved were genuinely the highest-scoring ones in the store — they just happened to be from two months ago, before the pivot, and nothing in the retrieval pass knew the difference. Worse, the model can’t say which episodes its claims came from, because the surface representation in the prompt was a flat list of text blobs with no back-pointers. The user closes the session and the team gets a Slack message asking why the AI keeps citing stale information with a straight face. The bug is not in the retrieval policy. The bug is the absence of two adjacent disciplines the retrieval policy quietly assumed someone else was handling: temporal reasoning (“when was this true?”) and provenance (“where did this come from?”). This article is the deep dive on both.
Opening bridge
Yesterday’s piece closed the read path of the memory subsystem with the Generative Agents α·recency + β·importance + γ·similarity rerank and the cache-replacement parallel. The piece ended with a list of failure modes, two of which are the seeds of this article. First, the temporal-query blindness: a static retrieval policy treats “what did we decide last month?” and “what should we decide now?” the same way, and the recency term pushes both queries toward the same recent episodes regardless of the asked-about date. Second, the staleness gate: some content has explicit validity intervals and should be filtered out (not merely down-weighted) when those intervals don’t cover the query’s reference time. The retrieval-policies piece named both as gotchas and deferred the deep dive. This is the deep dive. Today’s article also reaches back to the knowledge graphs piece, which introduced the bi-temporal model — valid_time vs transaction_time, the two-clock pattern temporal databases have shipped for decades — as a property of edges in a structured memory graph. That piece worked the write side of the bi-temporal model: how an edge gets its validity stamps when it’s created, how contradiction detection invalidates it without deletion. This piece works the read side: how to actually use those two clocks at query time, how to attribute remembered facts back to source episodes, and how to make “why does the agent believe X?” a structured query rather than a debugging guess.
Definition
Temporal reasoning over agent memory is the discipline of answering questions whose correct answer depends on when — both when the asked-about fact was true and when the system learned it. Provenance is the discipline of attributing every remembered claim back to the source episode (or chain of episodes) that produced it, with enough metadata that the chain can be audited or revoked. The two are siblings: temporal reasoning answers “when,” provenance answers “why we think so,” and a memory subsystem that does either one without the other degrades in predictable, costly ways.
Five operations a temporally-aware retrieval has to ground out. First, parse the temporal intent of the query — does the query carry an explicit reference time (“as of March,” “yesterday,” “three months ago,” “before the migration”), an implicit one (a question about a fact that has known turnover, like a job title or a forecast model version), or no temporal constraint at all? Second, filter candidates by validity — drop episodes whose valid_from/valid_to interval doesn’t cover the query’s reference time. Third, score by staleness, not just recency — recency favors the newest episode regardless of whether the asked-about thing has changed since; staleness asks “is this fact still believed to be true as of the query’s reference time?” Fourth, attach provenance to every returned claim — when the read path returns a fact, it returns the episode IDs that produced it, with enough metadata to walk the chain back to source. Fifth, handle disagreements explicitly — when two episodes contradict each other within the same valid window, the policy must either return both with explicit divergence flags or pick one with a recorded resolution rule, not silently quote whichever ranked higher.
What temporal reasoning is not. It is not recency-weighted retrieval (recency is a signal; temporal reasoning is the frame — recency answers “how new is this episode,” temporal reasoning answers “was this fact true at the asked-about time”). It is not the write-side bi-temporal model (that’s the data shape; this is the query-side discipline that uses it). It is not reflection (reflection writes higher-order claims; provenance tracks where they came from). It is not audit logging (audit logs are the journal of operations; provenance is the structured back-pointer chain of beliefs). All four interact; the temporal-reasoning piece is the layer above them, the one that turns the substrate into something an agent can defensibly answer a dated question against.
Intuition
The mental model that pays off is bitemporal databases applied to a memory subsystem. Temporal databases — the academic literature stretches back to Snodgrass and Ahn’s 1985 paper and the SQL:2011 standard codified the syntax — separate two clocks that hand-rolled “we added a created_at column” systems silently conflate. The clocks:
Valid time — when the fact was true in the world. “Priya was the user’s manager from 2025-09-15 through 2026-04-01.” This is the clock that answers point-in-time queries: “who was Priya’s manager as of 2026-03-15?”
Transaction time — when the system learned the fact. “We ingested the ‘Priya stopped being manager on 2026-04-01’ update on 2026-04-15.” This is the clock that answers system-belief queries: “what did the agent think on 2026-04-05?” — the answer is “Priya was still her manager,” because the system didn’t know about the change until the 15th.
The two clocks separate because corrections happen out of order. The user might tell the agent on April 15 that Priya stopped being their manager on April 1; the world-time fact ends on the 1st, but the system only knew about it on the 15th. Single-clock systems collapse the two and answer one query incorrectly: either they overwrite history (no audit trail for what the system used to believe) or they freeze the system’s belief (no way to learn that the world has changed). Two-clock systems answer both queries correctly, and the SQL:2011 standard ships the syntax — FOR SYSTEM_TIME AS OF and FOR BUSINESS_TIME AS OF — that every modern temporal database (Postgres-tsm, MariaDB, SQL Server, MS Fabric) now supports.
The memory subsystem inherits the same structure for the same reason. Every remembered fact ought to carry both clocks; every retrieval ought to take a reference time and filter by both. The agent-memory port adds a third dimension the temporal-database literature doesn’t have to worry about: the evidentiary clock — when the fact was last verified by the agent or the user, separate from when it was first ingested. A fact ingested in March and re-confirmed in May is more trustworthy than a fact ingested in March and never mentioned since, even if both have the same created_at. The shape of this three-clock model — (valid_time, transaction_time, last_verified) — is what production memory frameworks like Zep/Graphiti implement and what most hand-rolled memory systems quietly miss.
The provenance side of the model is conceptually simpler and operationally just as important: every fact has a source_episode_ids field that lists the raw episodes the fact was derived from. A reflection that says “the user is vegetarian” carries the four episodes that mentioned chicken-and-rice avoidance, lentil dishes, and “I don’t eat meat.” When the user later says “actually, I started eating fish last month,” the agent can walk the provenance chain back to the four sources, mark the reflection as superseded, and write a new one that reconciles. The chain isn’t an audit nicety — it’s the only mechanism a learned memory system has for handling contradictions without losing the ability to explain itself.
The distributed-systems parallel — bitemporal tables and append-only ledgers
Three parallels, each load-bearing.
Bitemporal SQL tables are the structural analogue for two-clock agent memory. The SQL:2011 temporal extensions define SYSTEM_TIME (transaction time, automatically managed by the DBMS) and application-defined period columns for valid time. A bitemporal query SELECT * FROM employees FOR SYSTEM_TIME AS OF '2026-04-05' WHERE id = 42 AND business_time CONTAINS '2026-03-15' returns the row as the system understood it on April 5, for the fact as it applied to March 15 — two independent filters, both required for correctness. The agent-memory port is the same: a memory query retrieve(query, as_of_business=date_in_question, as_of_system=date_in_question_or_now) filters by both clocks. The Zep paper’s retrieval API takes a valid_at parameter for exactly this reason; production frameworks that don’t expose the second clock force the user to either accept “current system belief” semantics (no historical-belief queries) or hand-roll the filter in application code.
The provenance chain is an append-only ledger with cryptographic hash references. The pattern is borrowed almost directly from event-sourced systems and append-only ledgers (Git’s commit graph, Kafka with retained offsets, blockchain merkle trees). Every higher-order claim — a reflection, a semantic fact, a summary — points back to the lower-order entries that produced it, by ID and ideally by content hash. When a source episode is corrected, the dependent claims can be walked forward (the inverse of the back-pointer) and either revalidated or marked stale. Without the chain, the only way to revalidate a derived claim is to re-derive it from scratch, which scales poorly. With the chain, contradiction propagation is a graph traversal — touch the source, walk the dependents, mark or recompute. The event-sourcing literature (Greg Young’s foundational write-up, Fowler’s elaborations) is the cleanest reference for the pattern, and the agent-memory variant is a near-direct port.
The staleness gate is a circuit breaker on per-fact validity, not on per-episode freshness. The circuit-breaker pattern from distributed systems wraps a potentially-failing call with a fast-fail check that trips when the call’s recent behavior crosses a threshold. The memory-side analogue: each fact’s validity interval is the circuit’s closed/open state. When the query’s reference time falls outside the interval, the gate trips (the fact is excluded entirely, not merely scored lower); when the time is in the interval, the gate stays closed and the fact participates in retrieval normally. The reason this is a gate rather than a score is the same reason a tripped circuit breaker doesn’t return slightly-slower results — a stale fact is worse than no fact, because it confidently misleads, and the right behavior is to refuse to surface it at all. Production frameworks that conflate “old” with “stale” (down-weighting by recency without an explicit validity check) silently surface contradicted facts and the user notices first.
The as-of query — temporal intent classification
The single highest-leverage piece of the temporal-reasoning stack is the intent classifier on the query: a lightweight pass (regex, small-model call, or both) that extracts the query’s reference time before the retrieval runs. Five intent shapes worth recognizing:
Explicit absolute time — “what did we decide on March 15?” The reference time is 2026-03-15. The retrieval filters by valid_from ≤ 2026-03-15 ≤ valid_to, drops the recency term entirely (α = 0), and may also filter by transaction_time ≤ as-of-date if the user is asking about historical system belief rather than historical world state.
Explicit relative time — “what did we decide three months ago?” Resolve to absolute (now - 90d) and proceed as above. The trap is locale-dependence: “yesterday” at 23:55 means a different date in UTC than in PST; the classifier should resolve relative times against the user’s locale or fail explicitly. The Test of Time benchmark (Fatemi et al., 2024) measures exactly this kind of relative-time resolution and finds that LLMs lose 23-35% accuracy when shifting from “in 2020” to “4 years ago,” even though the two refer to the same absolute date — the classifier carries non-trivial difficulty and should be measured.
Implicit temporal intent — “what’s the user’s job?” carries no explicit time but reads against a fact that has known turnover. The classifier needs to recognize the fact category (job, manager, address, preferences-known-to-change) and resolve the reference time to “current” — which means the filter is valid_from ≤ now ≤ valid_to, dropping any fact whose valid_to is in the past. Without the classifier, an old “user works at Acme” fact and the current “user works at Beta Corp” fact both retrieve, and the rerank picks whichever has higher cosine similarity to the query.
Before/after queries — “what did we believe before the pivot?” requires identifying the pivot event in the timeline and filtering by valid_from ≤ pivot_time. The pivot itself is typically an episode in the store; the classifier resolves “before the pivot” to “valid_to < pivot_episode.ts” by walking back to find the canonical episode that the user is referencing.
No temporal intent — most queries fall here. The default is “as of now,” and the temporal filter is permissive (valid_from ≤ now ≤ valid_to) but doesn’t override the rerank’s recency weighting. The query proceeds through normal retrieval; the temporal filter just prunes facts whose validity windows have explicitly expired.
The implementation is a tiered classifier: cheap regex catches explicit dates and relative-time phrases; a small-model call (Haiku-class, ~50ms) catches the implicit and before/after cases; if both miss, fall through to “no temporal intent.” Mem0’s temporal-intent flag and Zep’s valid_at parameter both expose the resolved reference time to downstream retrieval, and both report measurable wins on temporal-RAG benchmarks (the ChronoQA and TEMPRAGEVAL datasets are the published yardsticks).
Staleness as a fourth retrieval signal
The retrieval-policies article blended three signals: recency, importance, similarity. Temporal-aware retrieval adds a fourth: staleness, the inverse of “the fact is believed to still be true as of the query’s reference time.”
The signal is binary in the strict version (stale or not) and continuous in the soft version (probability the fact is still current). The strict version is a gate: a stale fact gets score = 0 regardless of other signals. The soft version is a down-weight: a fact whose validity is about to expire (or whose category has high turnover) gets multiplied by a freshness ∈ [0, 1] factor. Both are defensible; the right choice depends on the fact category.
Strict gating for tagged-with-validity facts. Any fact with an explicit valid_to timestamp is gated: if valid_to < query_reference_time, exclude. The knowledge-graph world has this for free — every edge carries the interval. The vector-memory world has to add it as metadata on writes, which most retrofitted systems don’t do, which is why the staleness failure mode is more visible in vector-only stacks than in graph-augmented ones.
Soft down-weighting for category-typed facts without explicit validity. A fact tagged as "job_title" has no valid_to written at ingest time, but the category has a known half-life — average job tenure is roughly 4 years. A "food_preference" fact has a half-life closer to 5-10 years. A "name" fact (the user’s own name) is effectively immortal. The down-weight is exp(-Δt / category_half_life), exactly the same shape as the recency decay but tuned per category instead of globally. The signal is empirically powerful — the temporal-validity literature finds that LLM-based fact-retrieval systems with category-conditional staleness gates outperform uniform-decay systems by 6-15% on temporal benchmarks.
The verification boost. A fact that’s been re-confirmed by the user recently (the last_verified clock) gets a freshness boost — freshness *= 1 + δ · 1[recently_verified]. The signal captures the behavioral component of staleness: a fact the user re-mentioned last week is current regardless of when it was first ingested. The mechanic shows up in MemoryBank’s read-driven salience update (the Ebbinghaus-curve-inspired pattern from the retrieval-policies piece) and applies to staleness for the same reason it applies to importance: reactivation is evidence.
The contradiction penalty. A fact that has been explicitly contradicted by a later episode — the user said “I switched stacks” after the agent previously believed “user uses Postgres” — gets freshness = 0 even if its valid_to is null. The contradiction-detection step is itself an LLM call (pattern-match for “X used to be Y but is now Z” or “actually, X” or “we pivoted on X”); the Zep/Graphiti contradiction-handling pipeline is the cleanest documented example, and the memory conflict and forgetting article works the full ADD/UPDATE/DELETE/NOOP resolver pattern in detail. The mechanic is the agent-memory port of the stamp-the-row-as-deleted-but-keep-it pattern from event-sourced systems: the contradicted fact stays in the store (for audit) but is excluded from retrieval.
The combined scoring formula:
| |
Where staleness_gate is binary (0 or 1) and freshness_factor ∈ [0, 1] captures the soft signals. The multiplicative structure is intentional: a stale fact (gate = 0) is excluded regardless of the rest, while a borderline-stale fact (freshness = 0.3) is down-weighted but still retrievable if the other signals are strong enough.
Provenance — the back-pointer chain
The provenance chain is structurally an inverse-index over the writes: every higher-order claim carries a source_episode_ids: [...] field, and every source episode carries a (computed-on-demand) list of dependent claims via reverse lookup. Three operations the chain has to support:
Walk-down: claim → sources. Given a returned fact, list the episodes that produced it. The chain is read at answer-rendering time — when the agent quotes “the user is vegetarian,” the harness can append “(based on episodes from March 3, March 17, and April 2)” or expose the source IDs through an API for downstream UI. The shape is exactly the citation pattern every production RAG system already ships, ported to memory: don’t return a fact without its sources.
Walk-up: source → dependents. Given an episode that’s been corrected or invalidated, find the higher-order claims that derived from it. The chain is read at write time when a contradiction lands — the contradiction-detection step marks the new episode as superseding the old, and the propagation pass walks every reflection/summary that cited the old episode and either revalidates or marks-as-stale each one. Without the walk-up, contradictions are content-local — the old episode is stamped invalid, but the reflection built on it is still confidently citable in retrieval. The chain is what makes contradictions systemic.
Walk-graph: claim → claim. A reflection cites episodes; a meta-reflection cites reflections; the chain can be arbitrarily deep. The walk-graph operation is a transitive closure over the back-pointer graph and gets called when the agent is asked “explain why you believe this” or when a debugging pass is reconstructing the reasoning behind a confident-but-wrong answer. The Generative Agents paper does this implicitly through its reflection tree; production frameworks like Letta and Mem0 expose it as an API.
The discipline that separates a useful provenance chain from a heavyweight audit fixture is bounded depth. A reflection chain that grows linearly with session count becomes expensive to walk, expensive to render, and (worse) expensive to invalidate. The mitigation is depth-bounded provenance — each claim points back at most 2-3 layers, with deeper provenance reachable via the sleep-time consolidator but not surfaced by default. The full chain stays in the durable store for audit; the runtime walk is shallow.
Code: Python — as-of retrieval with provenance walk
The smallest interesting build: a retrieval pass that takes a query plus a reference time, filters by bi-temporal validity, computes the four-signal score (recency + importance + similarity + staleness/freshness), and returns each hit with its provenance chain attached. Substrate: Chroma for vectors, a sidecar SQLite table for the validity intervals and provenance edges (because Chroma’s metadata filtering doesn’t support range queries on multiple keys efficiently). Install: pip install chromadb.
| |
Five things to notice. First, the staleness gate is a continue in the candidate loop, not a down-weight — a fact whose validity doesn’t cover the reference time is excluded entirely, the right shape for the “stale fact is worse than no fact” discipline. Second, the bi-temporal filter is two checks, not one — valid_from ≤ ref ≤ valid_to is the business-time filter; transaction_time ≤ system_time is the system-time filter; both are required to answer historical-belief queries correctly. Third, the freshness signal is multiplicative, not additive — a partially-stale fact (category half-life starting to bite) is down-weighted, but the binary gate has already excluded the explicitly-stale facts. Fourth, provenance is attached at the end, not the start — the candidates are scored against business signals; the provenance walk only runs over the top-N, which keeps the cost bounded. Fifth, the supersede helper is the contradiction primitive — it doesn’t delete the old fact; it stamps its valid_to and writes a back-pointer to the new fact via superseded_by, exactly the soft-delete pattern.
Code: TypeScript — same shape against LangGraph stores
| |
The structural shape is identical: bi-temporal hard gate, category-conditional soft freshness signal, min-max normalize-and-blend on the three classical signals, multiplicative freshness factor at the end, depth-1 provenance walk over the top-N. The TypeScript port uses a sidecar SQLite (via better-sqlite3) for the same reasons as the Python — vector-store metadata filters are typically too limited for range queries on multiple keys, and a relational sidecar gets the indexing right.
Trade-offs, failure modes, and gotchas
The clock-skew problem. Distributed agents writing to the same memory store with locally-different wall clocks produce out-of-order valid_from and transaction_time values that confuse every temporal query. The mitigation is the same as in distributed databases: use a single authoritative timestamp source for writes (NTP-synced or, better, a Hybrid Logical Clock) rather than trusting the client’s wall clock. Most production memory frameworks default to “server timestamp at write” for this reason; the trap is migrating from a single-tenant prototype (where client wall-clock is fine) to a multi-tenant production deployment without fixing the clock model.
The category-half-life calibration trap. Setting category half-lives by introspection (“how often does someone change jobs? 4 years sounds right”) is the default and is wrong as often as it’s right. The defensible move is to measure — for each category, observe the empirical distribution of valid_to - valid_from intervals in your store, fit a half-life to it. Most projects skip this and ship the introspected values; the failure mode is silent (good-feeling answers, occasional confidently-stale citations) until someone runs an audit.
The contradiction-detection false-negative. The contradiction-detection step (“does this new episode supersede an existing fact?”) is an LLM call with non-zero error rate. False negatives mean two contradictory facts coexist in the store with overlapping validity windows, and the retrieval rerank picks whichever has higher cosine similarity. The mitigation is a periodic sleep-time pass that runs an exhaustive contradiction check across the store, not just per-write. The Mem0 paper documents this as a background “consolidation” job; production deployments that skip it accumulate contradictions linearly with use.
The “we never wrote valid_to” retrofit problem. A vector-only memory store retrofitted to support as-of queries has every existing row with valid_from = ingest_time and valid_to = null. The category-conditional freshness signal still works (it’s purely a function of age and category), but the explicit-validity gate doesn’t gate anything because no row has a non-null valid_to. The mitigation is a one-time backfill: for each row, infer valid_to from the closest superseding episode (if any) using an LLM pass. The backfill is expensive (one model call per row) but only runs once; without it, the retrofit silently degrades to “soft freshness only,” which is better than nothing but worse than the bi-temporal contract the new code paths assume.
The provenance-chain depth-blowup. A reflection cites three episodes; a meta-reflection cites three reflections (nine episodes transitively); a meta-meta-reflection cites three meta-reflections (twenty-seven transitively). Without depth bounds, the walk grows exponentially and the “explain why” query becomes unanswerable. The mitigation is depth-bounded provenance (cap at 2-3 layers in the runtime walk) plus summary nodes — the sleep-time consolidator can write a flattened summary of a deep chain into a single provenance pointer, and the runtime walk hits the summary first. Generative Agents handles this implicitly via the importance threshold gating reflection-of-reflections; production frameworks have to make the bound explicit.
The reference-time-leakage bug. A retrieval that takes as_of_business=march_15 but accidentally passes as_of_system=now returns facts that were “true in March according to what we know now” — which is the correct current-belief view but the wrong historical-belief view if the user wanted “what did we think on March 15?” The bug is silent — the answers look reasonable — and only an audit query catches it. The mitigation is explicit parameter naming (call them business_time and system_time, not as_of_a and as_of_b) and a default that requires both to be passed when either is non-now.
The “the user is the source of truth, except when they aren’t” problem. User-asserted facts are typically given high importance and a permissive validity window, but users mis-remember and mis-state things. A fact the user just told the agent that contradicts a fact from a year ago is usually the new ground truth, but sometimes the user is wrong and the year-old fact (with corroborating provenance from multiple sources) is right. The mitigation is a user-vs-system arbitration policy — a small classifier that decides whether to trust the new user assertion outright or to flag the contradiction for explicit confirmation (“Just to confirm — last we spoke you said you used Postgres; you mentioned Cassandra just now. Did you switch, or am I misremembering?”). The classifier is a quality-vs-friction trade-off; most production agents err on the side of friction for high-stakes categories (medical, financial, legal) and silently update for low-stakes ones (food preferences, casual mentions).
The temporal-classifier’s brittleness on relative dates. “Last week” at the wrong time zone, “tomorrow” parsed as a literal string, “three months ago” when the user means “three business months ago” — the relative-time classifier has a long tail of edge cases. The mitigation is to render the resolved date back to the user on first temporal query in a session — “I’m interpreting ’last week’ as the week of March 8 through March 14. Is that right?” — and cache the resolution for the rest of the session. Without it, a misinterpreted relative date produces a confidently-wrong answer with no visible failure signal. The Test of Time benchmark (Fatemi et al., 2024) results suggest that even frontier models lose 20-30% accuracy when relative dates are involved; the classifier deserves explicit treatment, not the default LLM fallback.
The provenance-attribution noise. A reflection that cites four episodes “because the LLM said so” can be over-confident in the attribution — the reflection-generation prompt asked which episodes supported the claim and the model answered, but its self-reported attribution doesn’t always match the actual cosine-closest sources. The mitigation is evidence verification: when a reflection lands, re-embed it and confirm that each cited source has cosine similarity above a threshold, and prune sources that fail. Without this, the chain accumulates spurious attributions that don’t add audit value.
The “as of now” gate that’s too aggressive. A staleness gate that excludes everything with valid_to < now works correctly until the agent is asked a present-tense question about a fact whose valid_to was just stamped. The right answer is often still “yes, that’s still the case” — the stamp may have been a contradiction-detection false positive, or the user may be asking about a state that’s been recently superseded but not yet acknowledged. The mitigation is a grace period: facts whose valid_to is within the last N days remain retrievable but with a heavy freshness penalty (e.g., freshness *= 0.3), so the rerank can still surface them if the user explicitly references the relevant time. Pure binary gates are a sharp edge that catches users on the wrong side of recent invalidations.
The cross-tenant provenance leak. A reflection that cites episodes is correctly scoped to the user via the where clause on the read path, but the provenance walk — a follow-up query for “what episodes is this reflection based on?” — has to enforce the same tenant scope or it can return source IDs from other users. The mitigation is to scope provenance lookups by user explicitly (WHERE claim_id = ? AND user = ?) rather than by claim ID alone, and to audit every site in the read path that walks the provenance graph. The bug is rare but high-severity when it lands.
Further reading
- Zep: A Temporal Knowledge Graph Architecture for Agent Memory — Rasmussen et al., 2025 — the canonical reference for the bi-temporal graph model as applied to agent memory. The §3 architecture description has the cleanest available formalization of the two-clock model (valid time + transaction time), edge invalidation, and the temporally-filtered subgraph retrieval pattern. The §4 benchmark section reports 18-19% accuracy lifts over single-temporal baselines on Deep Memory Retrieval and LongMemEval, with the largest gains concentrated in temporal-reasoning categories.
- Language Models Struggle to Achieve a Consistent Temporal Representation of Facts — Bajpai et al., 2025 — the cleanest published evidence that LLMs have surface-level temporal awareness but lack robust consistency. Reports an empirical accuracy gap of 23-35% between explicit-absolute and explicit-relative phrasings of the same temporal query, and a global robustness score (across paraphrases) below 7% even for frontier models. The paper is the strongest argument that the temporal-intent classifier deserves explicit code, not a “the LLM will figure it out” fallback.
- A Question Answering Dataset for Temporal-Sensitive Retrieval-Augmented Generation — ChronoQA, 2025 — the published temporal-RAG benchmark covering 5,176 time-sensitive questions across absolute, relative, and aggregate temporal types. The methodology section is the closest available reference for designing a temporal eval against your own memory store; the empirical findings are a useful sanity check on what current systems get right and wrong.
- Snodgrass — Developing Time-Oriented Database Applications in SQL, 1999 — the foundational text on temporal databases, available in full as a free PDF. Chapters 4-6 work through the bitemporal model in detail; the agent-memory port loses essentially nothing in translation, and reading the SQL formulation makes the corresponding agent-memory mechanics far more legible than any LLM-flavored writeup alone.
What to read next
- Memory Retrieval Policies: Recency, Relevance, Importance — the read-side rerank this article builds on. The Generative Agents formula and the cache-replacement frame are the substrate; temporal-aware retrieval is the extension that adds the as-of dimension and the freshness factor.
- Knowledge Graphs as Structured Memory — the write-side companion: the bi-temporal graph data model whose
valid_time/transaction_timecolumns this article’s read path operates against. The two clocks are introduced there; the queries that use them are the focus here. - Sleep-Time Compute and Memory Consolidation — the background pass that runs exhaustive contradiction detection across the store and writes flattened provenance summaries to keep the runtime walk shallow. The temporal-reasoning failures this article enumerates (contradiction false-negatives, provenance depth-blowup) are mostly mitigated by a well-designed consolidator.
- Memory Conflict, Forgetting, and Embedding Drift — the natural successor: where the bi-temporal substrate this article builds carries the contradiction resolver, the active-forgetting eviction queue, and the embedding-drift migration patterns. The same
valid_to,last_verified, andsuperseded_byfields turn into a load-bearing layer for keeping a long-running store internally consistent.