$ cat ai-engineering/memory-conflict-and-forgetting.md

Memory Conflict, Forgetting, and Embedding Drift

Three failures of agent memory at scale: contradiction handling, active forgetting, and embedding drift — with worked patterns and code.

Jatin Bansal@blog:~/ai-engineering$ open memory-conflict-and-forgetting

Three things go wrong with a memory store once it’s been writing for long enough. First, two facts in the store disagree — the agent learned in March that the user’s stack was Postgres, and learned in April that the user moved to Cassandra, and both writes survived. Second, the store keeps growing without compensating decay — every write that was ever interesting stays interesting at retrieval time, the candidate set blows past what reranking can stay precise on, and the average answer gets duller week by week. Third, somebody upgrades the embedding model — text-embedding-3-small becomes text-embedding-4-small or voyage-3 jumps to voyage-3-large — and every vector in the store is now in a coordinate system the queries no longer reach. Each failure is its own discipline, each has its own canonical patterns, and each is the kind of thing a system can run for months without realizing it has — until the agent confidently cites stale information, or the retrieval recall slowly slides off a cliff, or the model upgrade lands and recall collapses overnight. This article works all three.

Opening bridge

Yesterday’s piece closed the temporal-reasoning subtree: bi-temporal validity, as-of queries, staleness gates, provenance back-pointers. That article handled the cases where the data itself tells you the answer — valid_to is in the past, so the fact is stale; transaction_time is after the query, so the system didn’t know yet. Today’s article handles the cases where the data doesn’t tell you: two facts coexist with overlapping validity windows and the rerank has no principled way to choose; the store is full of “still-valid” facts that just aren’t useful anymore; or the embedding model itself changed and the vector representation of “still-valid” is now in a different geometry than the query. The three failures are sibling diseases of the same organ — every long-running memory store eventually gets all three — and the bi-temporal substrate from yesterday is the platform they sit on. The contradiction handler reads valid_to. The forgetting policy reads last_verified and last_accessed. The drift mitigation reads the model version stamped at write time. Without yesterday’s substrate, none of today’s mitigations have anywhere to land.

Definition

Memory conflict resolution is the discipline of detecting when two facts in the store disagree and choosing how to surface the resolution to the read path. Three operational modes. Detect on write — when a new fact lands, scan the store for facts that overlap its category, scope, and validity window; if any contradict, fire the resolver. Detect on consolidation — a background pass over the store finds pairs of facts whose embeddings cluster above a similarity threshold and whose claims disagree on inspection; resolve and write the merged outcome. Detect on read — when retrieval surfaces two candidates that contradict, the rerank either picks one with a recorded resolution rule, returns both with explicit divergence flags, or asks the user. Production systems combine all three: write-time catches the obvious cases cheaply, consolidation catches what write-time missed, and the read-time check is the safety net for the pairs neither earlier pass resolved.

Forgetting is the discipline of actively removing or down-weighting memories that the system no longer benefits from carrying. Forgetting is not the same as unlearning. Forgetting is a soft-state property — the agent doesn’t retrieve the memory because the policy has decayed it below a usefulness threshold, but the underlying data is still in the store and could in principle be recovered. Unlearning is a hard property — the underlying data is gone, in a way that satisfies a verifiable deletion guarantee (the GDPR Article 17 case). Most production discussion of “forgetting” is the soft kind; the hard kind needs its own deletion pipeline, audit trail, and verification step, and getting it right is a separate engineering exercise from designing a decay policy.

Embedding drift is the cluster of failure modes that happens when the embedding model underneath a vector store changes versions. Five sub-failures, each with its own mitigation. Coordinate-system mismatch — a v1 vector and a v2 vector are not comparable; even for the same text, their cosines drift unpredictably. Recall collapse — the most acute symptom: a v2 query against a v1 index returns near-random nearest neighbors. Hybrid skew — partial migration leaves the index with mixed v1 and v2 vectors, and reranking sees a coordinate-system mixture rather than a semantic ordering. Cost blast — full re-embedding of a large store is expensive in both compute and operational risk. Index rebuild downtime — re-embedding invalidates the ANN index, which has to be rebuilt against the new vectors and brought online without dropping production traffic. The mitigation menu — dual-index migration, alias swaps, Drift-Adapter affine maps — exists because no single approach wins on all four axes.

Intuition

The mental model that ties the three together is every write is a temporal claim against a coordinate system, and both of those can change underneath it. A fact written in March 2026 was true in March, was true because the embedding model encoded “Postgres” near “relational database” and far from “Cassandra,” and was true given the system’s belief state at the time. Three months later, the fact may no longer be true (the user switched stacks — contradiction), may no longer be useful (the user hasn’t mentioned it since — eviction candidate), and may no longer be findable (the embedding model was upgraded and the vector for “Postgres user” doesn’t cosine-match the v2 vector for “what’s the user’s database?”). The substrate is the same: a fact is durable in its claim and its representation, and a long-lived store has to handle both decaying out from under it.

Three intuitions worth holding while reading the rest.

Conflict resolution is a write-amplification problem in disguise. Every contradiction the agent detects on the read path is a contradiction the write path didn’t catch — and every contradiction the read path quotes is a reflection or summary the write path generated against stale source data. The cheapest place to handle a contradiction is at the moment of the second write, when the agent has both facts in working memory and can call an LLM resolver before either lands in the durable store. The most expensive place is at retrieval time, where every read pays the rerank cost of carrying both contradictory facts through the candidate set. Push the work upstream.

Forgetting is the read-path’s mirror of the write policy. The memory write policy decides what enters the store; the forgetting policy decides what leaves. Both are scoring functions over the same fact metadata (importance, category, last access, last verification), and both can be tuned against the same memory evals — the contradiction-resolution category on MemoryAgentBench is the most direct measurement of whether the resolver in this article is actually working. The asymmetry: write decisions are made with full agent context, forgetting decisions are made in the background against a snapshot. Forgetting has to be conservative in a way write policy doesn’t.

Embedding drift is the version-skew problem from distributed systems, applied to representations. The same data lives in two coordinate systems that are not interconvertible without information loss. The patterns are the patterns from any distributed-system migration: dual-write the new version, backfill the old, validate, switch reads, decommission. The novelty isn’t the shape of the migration — it’s that the cost ratio is different (re-embedding a billion vectors is a real bill) and the validation is fuzzier (recall metrics, not row counts) than in a classical database migration.

The distributed-systems parallel — soft-delete, LRU eviction, and live schema migrations

Three parallels, each load-bearing.

Soft-delete with audit trail is the canonical contradiction-resolution pattern. In any event-sourced database — Datomic, XTDB, Postgres-temporal with range types, Cassandra with TTL-stamped tombstones — the way contradictions are handled is the same shape: don’t delete the old row, mark it superseded and stamp the new row with a back-pointer. The agent-memory port is identical. Mem0’s DELETE operation doesn’t physically remove the contradicted memory; it marks the row as INVALID and keeps it in the store for temporal-reasoning queries. The pattern’s value is precisely what it is in the database literature — auditability (“why did the agent believe X on March 15?”), recoverability (the supersession may itself be wrong, in which case the rollback is just an UPDATE of the validity stamp), and the ability to answer historical-belief queries that need the system’s belief state at a past time. The cost is storage growth: contradictions never delete data, they just stack it. Production deployments accept the storage cost and lean on the background consolidator to compress dead branches into summary nodes.

Cache eviction algorithms are the direct analogue for active forgetting. The memory-retrieval-policies article made the cache-replacement parallel explicit on the read path; the forgetting policy is the same pattern on the storage side. Pure-LRU (last-access wins) over-evicts important-but-rarely-accessed facts. Pure-LFU (frequency wins) over-evicts new facts that haven’t had time to accumulate hits. Importance-weighted policies (a fact’s score is α·log(access_count) + β·importance + γ·exp(-Δt/half_life)) are the agent-memory analogue of ARC, which adapts its LRU/LFU mix to the workload. The agent-memory variant adds two terms ARC doesn’t have: a category half-life (a name fact decays much slower than an event fact), and a verification recency boost (a fact the user just re-confirmed gets bumped, regardless of how rarely it’s been accessed). The shape — a learned eviction queue over per-fact scores — is the FadeMem approach, which reports a 45% storage reduction over Mem0 at comparable benchmark accuracy.

Live schema migrations are the canonical pattern for embedding drift. A relational schema change with millions of dependent rows is the classic case of “you can’t take the system down, you can’t lose any writes, and the new shape has to be validated before reads switch over.” The five-stage migration is canonical: (1) start dual-writing to both old and new representations; (2) backfill the new representation from the existing store; (3) validate that read paths against the new representation return semantically equivalent results; (4) switch reads to the new representation; (5) stop writing the old, then drop it after a grace period. The agent-memory port is identical, with one substitution: “schema” becomes “embedding model.” The dual-write phase encodes every new write with both v1 and v2; the backfill re-encodes the v1 store with v2; the validation phase shadow-queries the v2 index against the v1 index and compares recall (NDCG@10, Recall@10 against a held-out query set); the read switch is an alias swap on the index name; the decommission phase drops the v1 index after the agreed grace window. The novelty is the validation — in a classical migration, the validation is “row counts match”; in an embedding migration, it’s a quality metric on retrieval against a held-out set. Drift-Adapter collapses stages 2-4 by training a small affine map (or low-rank residual MLP) that translates v2 queries into the v1 space at query time, recovering 95-99% of the recall a full re-embedding would have produced — at <10µs of query latency and ~100× less compute than the dual-index path.

Contradiction handling — the four-state resolver

The cleanest formalization of write-time conflict resolution is the Mem0 ADD/UPDATE/DELETE/NOOP classifier. For every candidate write, an LLM-driven resolver compares the candidate against the closest existing facts and decides:

ADD — no semantically equivalent fact exists. Write the candidate as a new entry. The default case.
UPDATE — an existing fact is closely related and the new fact augments it (an additional detail, a refined version, a constraint added). Merge into the existing entry; do not write a new one.
DELETE — an existing fact is contradicted by the new one. Mark the old as INVALID (do not physically remove), and write the new as a separate entry with a supersedes pointer back to the old.
NOOP — the new fact is already represented in the store with no meaningful change. Discard.

The classifier is a single LLM call with the candidate plus the top-K (K ≈ 5-10) closest existing facts in the prompt; the model returns the operation and the target fact ID. Cost is bounded — one resolver call per write, against a small candidate set. The Mem0 paper reports the resolver running on every memory update and being the primary mechanism by which the store stays internally consistent.

Three failure modes worth handling explicitly.

The false-negative cascade. The resolver decides ADD when it should have decided DELETE — a real contradiction is missed because the embedding similarity between the old and new facts was below the candidate-retrieval threshold. The contradiction sits in the store until something surfaces both facts in the same retrieval and the rerank trips. The mitigation is the sleep-time consolidation pass — a periodic background job that runs the resolver across every fact pair within a category, not just the top-K-similar ones, and catches the contradictions that the per-write retrieval missed. Mem0 documents this as a consolidation job; it reports the consolidator catching 5-10% additional contradictions per pass on a steady-state store.

The high-density contradiction cliff. Most workloads have contradiction density (fraction of new writes that supersede an existing fact) in the single-digit-percent range, and the resolver handles them comfortably. Some workloads — onboarding flows where the user is rapidly updating their profile, debugging conversations where the model is iterating on a hypothesis — spike into the 20-30% range, and the resolver’s load doubles and its accuracy drops (more candidates means more chances of confusion in the classification prompt). Mem0’s benchmark numbers show the resolver staying above threshold up to 30% contradiction density with timestamp-aware features turned on; beyond that, the policy needs to either widen the candidate window (more LLM cost) or accept some contradictions in the store and rely on the consolidator. The pattern is what queueing theory calls “the saturation knee” — the operating curve is fine up to a load and then degrades sharply.

The user-vs-system arbitration problem. A new user assertion contradicts an existing system-derived belief. The user is usually right — they’re the source of truth on their own preferences, history, and intent. But sometimes they mis-remember or mis-state, and the year-old belief (with corroborating provenance from multiple sources) is the more trustworthy one. The arbitration policy is a classifier on top of the resolver: for high-stakes categories (medical, financial, legal), confirm explicitly before overriding; for low-stakes categories (food preferences, casual mentions), update silently. The friction trade-off is workload-specific; production agents that don’t make the choice explicitly default to “always trust the user” and accumulate quiet errors when the user is wrong.

Active forgetting — biologically-inspired decay

The Ebbinghaus forgetting curve — a memory’s retention decays exponentially with time, slowed by each repetition — is the standard frame. The agent-memory port carries three knobs the cognitive-psychology version doesn’t have: per-category half-lives, an importance multiplier, and a verification boost.

Per-category half-lives. The decay rate is not global. A name fact (half_life = ∞) shouldn’t decay at all. A food-preference fact decays over a multi-year horizon. An event fact (half_life ≈ 30 days) decays fast. Setting the half-lives by introspection — “events feel like they decay over a month” — is the default and is wrong as often as right. The defensible approach is to measure — fit half-lives against the empirical distribution of valid_to - valid_from intervals in your store, conditional on category. Most projects skip the measurement step and ship the introspected values; the failure mode is silent until an audit catches it.

Importance multiplier. Two facts in the same category can have very different long-run value. “The user’s job title is Staff Engineer” decays at the job-title half-life, but a high-importance variant (“the user’s promotion to staff was contingent on the Q2 launch”) deserves to decay slower. The standard form is strength(t) = importance · exp(-Δt / half_life), with importance ∈ [0, 1] set by the salience scorer at write time.

Verification boost. A fact that’s been re-confirmed by the user (the last_verified timestamp got updated) gets a strength boost — strength *= 1 + δ · indicator(recently_verified). The signal captures the behavioral component of memory: a fact the user re-mentioned last week is current regardless of when it was first ingested. The mechanic shows up in the MemoryBank read-driven strength updates, and it’s the same shape as the verification term in yesterday’s temporal-reasoning piece.

The combined eviction score:

text

1
score(fact) = importance · exp(-(now - last_access) / half_life[category]) · (1 + δ · recently_verified)

Facts whose score falls below a threshold (or whose total count exceeds a budget) get evicted. Eviction is not deletion in the unlearning sense — the row is moved to a cold tier, marked archived, and excluded from the default retrieval path. A subsequent explicit query can still find it. The pattern is exactly the hierarchical-memory demotion from working → episodic → cold; “forgetting” is the demotion at the storage layer.

FadeMem ships this with a dual-layer hierarchy and reports the 45% storage reduction at comparable benchmark accuracy. The biological-fidelity argument is interesting but not load-bearing; the engineering argument is that the score is principled (every term has a measurable workload analogue), the policy is cheap (one score per fact per pass), and the failure modes are localized (a wrongly-evicted fact is recoverable from the cold tier, unlike a wrongly-deleted fact).

The hard form of forgetting — machine unlearning — is a separate engineering exercise. GDPR Article 17 erasure (“delete everything the agent knows about user X”) requires the data to be provably gone from every derived artifact: the vector store row, the reflection that cited it, the embedding cache, the prompt-caching prefix, the trace logs. The audit boundary is much wider than the soft-forgetting boundary. The 2025 machine unlearning literature covers the technique mix — retraining, targeted parameter editing, output filtering — but most production deployments handle the vector-store side with a hard delete plus a reflection-walk that prunes dependent claims, accepting some residual influence in the model weights themselves. The right discipline is to treat unlearning as a deletion pipeline, not a memory-policy tuning exercise, and to instrument it with a verification step that confirms each derived artifact actually got pruned.

Embedding drift — the migration patterns

When the embedding model under the store changes — text-embedding-3-small → text-embedding-4-small, voyage-2 → voyage-3-large, an open-source model swap — the vectors already in the store are not directly usable by the new model’s queries. Four migration patterns, in increasing order of cost-vs-recall trade-off.

The naive re-embedding. Re-encode every fact in the store with the new model. Rebuild the ANN index. Switch reads. Cost: linear in store size, O(N) model calls. Operational risk: high — the index rebuild needs to happen offline, and during the rebuild window the read path is either down or serving stale results. For a million-fact store at ~$0.00002 per embedding, the re-encoding bill is $20, but the index rebuild and the validation pass dominate operationally. Defensible for small stores; impractical at scale.

Dual-index migration. Run v1 and v2 indexes in parallel. Encode every new write with both models, store both vectors. Backfill the v2 index from the existing v1 corpus in the background. Validate by shadow-querying both indexes and comparing recall against a held-out query set. When v2 recall ≥ v1 recall, swap the read alias from v1 to v2. Decommission v1 after a grace window. Cost: doubled write traffic and doubled storage during the migration window; one round of re-encoding. Operational risk: low — every step is rollback-safe, the read path never breaks, and the validation is principled. The pattern is standard in production vector deployments and is documented across Pinecone, Weaviate, and Qdrant operational guides. The cost is high but bounded; the safety is essentially complete.

Alias-based versioning. A refinement on the dual-index pattern: every index is named with the model version and a date (memories_v1_20260201, memories_v2_20260415), and the application references a stable alias (memories_current) that points to whichever version is live. The alias swap is atomic; the rollback is an alias-flip. The cost is the same as dual-index; the operational ergonomics are dramatically better because every deployment that depends on the index references an alias rather than a version, and the alias is the single point of cutover.

Drift-Adapter affine maps. The 2025 result that changes the cost calculus. Instead of re-embedding every vector, train a small transformation T: ℝᵈ¹ → ℝᵈ² on a sample of paired old/new embeddings — encode a few thousand documents with both models, fit an orthogonal Procrustes map (or a low-rank affine, or a small residual MLP) from v1-space to v2-space. At query time, the new model encodes the query into v2-space, the adapter projects the query into v1-space, and retrieval runs against the existing v1 index. The Drift-Adapter paper (Vejendla, EMNLP 2025) reports the affine map recovering 95-99% of the recall a full re-embedding would have produced, on MTEB text corpora and a 1M-item CLIP image upgrade, with <10µs of added query latency and 100× less recompute than the dual-index path. The trade-off is the 1-5% recall gap — Drift-Adapter is a pragmatic migration that defers the re-embedding rather than eliminating it, and the residual gap matters for workloads where the top-K is the entire signal. For most application-side retrieval, the gap is well below other sources of noise.

The choice between dual-index and Drift-Adapter is a workload question, not an architecture question. If recall@10 is the dominant metric and the 1-5% gap is visible in evals, run the full dual-index. If query latency and migration cost dominate, the adapter is the better path. Production deployments that have done both report using the adapter as a bridge — deploy it on the day of the model upgrade to keep recall flat, then run the dual-index migration in the background over the following weeks, and decommission the adapter when the v2 index is fully populated and validated.

Code: Python — conflict-resolving writer with active forgetting

The combined write path: a new candidate fact arrives, the resolver classifies the operation (ADD/UPDATE/DELETE/NOOP), and the active-forgetting pass runs against the store to evict cold facts after the write lands. Substrate: Chroma for vectors, the same SQLite sidecar pattern from yesterday’s piece for the metadata and the validity intervals, and the Anthropic SDK for the resolver call. Install: pip install chromadb anthropic.

python

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
# pip install chromadb anthropic
import json
import math
import os
import sqlite3
import time
import uuid

import chromadb
from anthropic import Anthropic

client = Anthropic()  # ANTHROPIC_API_KEY in env
chroma = chromadb.PersistentClient(path="./memory_store")
facts = chroma.get_or_create_collection("facts")

meta = sqlite3.connect("memory_meta.db")
meta.executescript("""
CREATE TABLE IF NOT EXISTS facts (
  id TEXT PRIMARY KEY,
  user TEXT NOT NULL,
  category TEXT,
  text TEXT NOT NULL,
  importance REAL DEFAULT 0.5,
  valid_from REAL NOT NULL,
  valid_to REAL,
  transaction_time REAL NOT NULL,
  last_access REAL,
  last_verified REAL,
  access_count INTEGER DEFAULT 0,
  superseded_by TEXT,
  invalid INTEGER DEFAULT 0,
  archived INTEGER DEFAULT 0,
  embedding_version TEXT NOT NULL
);
CREATE INDEX IF NOT EXISTS idx_user_cat ON facts(user, category, archived);
""")

CATEGORY_HALF_LIFE_S = {
    "name": float("inf"),
    "food_preference": 5 * 365 * 86_400,
    "job_title": 4 * 365 * 86_400,
    "address": 3 * 365 * 86_400,
    "preference": 2 * 365 * 86_400,
    "event": 30 * 86_400,
    "fact": 365 * 86_400,
}

EMBEDDING_VERSION = "text-embedding-3-small-2026-05"


def _eviction_score(row: tuple, now: float, delta_verify: float = 0.3) -> float:
    """The combined eviction score: importance * decay * verification boost."""
    (
        _id, _user, category, _text, importance,
        _vf, _vt, _txn, last_access, last_verified,
        _count, _supersedes, _invalid, _arch, _emb_ver,
    ) = row
    half_life = CATEGORY_HALF_LIFE_S.get(category, 365 * 86_400)
    last = last_access or _txn
    decay = math.pow(0.5, (now - last) / half_life) if half_life != float("inf") else 1.0
    boost = (
        1.0 + delta_verify
        if last_verified and (now - last_verified) < 30 * 86_400
        else 1.0
    )
    return importance * decay * boost


# ---------- Conflict resolver ----------
RESOLVER_SYSTEM = """You are a memory conflict resolver. Given a candidate new fact and the most-similar existing facts in the store, return JSON of the form
{"op": "ADD" | "UPDATE" | "DELETE" | "NOOP", "target_id": "<id or null>", "reason": "<short>"}.

ADD when no existing fact represents the same claim.
UPDATE when an existing fact is closely related and the candidate refines it.
DELETE when an existing fact is contradicted by the candidate (mark old INVALID, write new).
NOOP when the candidate is already represented with no meaningful change.

Be conservative. Prefer ADD over DELETE when uncertain.
"""


def resolve(candidate: dict, neighbors: list[dict]) -> dict:
    msg = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=200,
        system=RESOLVER_SYSTEM,
        messages=[
            {
                "role": "user",
                "content": json.dumps(
                    {"candidate": candidate, "existing": neighbors}, indent=2
                ),
            }
        ],
    )
    raw = msg.content[0].text
    # Tolerate models that wrap JSON in prose.
    start = raw.find("{")
    end = raw.rfind("}") + 1
    return json.loads(raw[start:end])


# ---------- Write path ----------
def write_with_resolver(
    user: str,
    text: str,
    category: str,
    importance: float = 0.5,
    valid_from: float | None = None,
) -> dict:
    now = time.time()
    valid_from = valid_from or now

    # 1. Pull the top-K candidates by similarity (category-scoped).
    hits = facts.query(
        query_texts=[text],
        n_results=8,
        where={"$and": [{"user": user}, {"category": category}]},
    )
    neighbor_ids = hits["ids"][0] if hits["ids"] else []
    placeholders = ",".join("?" * len(neighbor_ids)) if neighbor_ids else "NULL"
    rows = (
        meta.execute(
            f"SELECT * FROM facts WHERE id IN ({placeholders}) AND invalid = 0 "
            f"AND archived = 0",
            neighbor_ids,
        ).fetchall()
        if neighbor_ids
        else []
    )
    neighbors = [{"id": r[0], "text": r[3], "valid_from": r[5]} for r in rows]

    # 2. If no neighbors, ADD without an LLM call.
    if not neighbors:
        return _do_add(user, text, category, importance, valid_from, now)

    # 3. Resolver call.
    decision = resolve(
        {"text": text, "category": category, "valid_from": valid_from},
        neighbors,
    )
    op = decision["op"]
    target = decision.get("target_id")

    if op == "NOOP":
        # Touch the target's last_verified — the fact was re-confirmed.
        if target:
            meta.execute(
                "UPDATE facts SET last_verified = ? WHERE id = ?", (now, target)
            )
            meta.commit()
        return {"op": "NOOP", "id": target}

    if op == "UPDATE" and target:
        meta.execute(
            "UPDATE facts SET text = ?, importance = MAX(importance, ?), "
            "last_verified = ? WHERE id = ?",
            (text, importance, now, target),
        )
        facts.upsert(ids=[target], documents=[text])
        meta.commit()
        return {"op": "UPDATE", "id": target}

    if op == "DELETE" and target:
        new_id = _do_add(user, text, category, importance, valid_from, now)["id"]
        meta.execute(
            "UPDATE facts SET invalid = 1, valid_to = ?, superseded_by = ? "
            "WHERE id = ?",
            (now, new_id, target),
        )
        meta.commit()
        return {"op": "DELETE+ADD", "old_id": target, "new_id": new_id}

    # Default to ADD.
    return _do_add(user, text, category, importance, valid_from, now)


def _do_add(
    user: str, text: str, category: str, importance: float,
    valid_from: float, now: float,
) -> dict:
    fid = str(uuid.uuid4())
    facts.add(ids=[fid], documents=[text], metadatas=[{"user": user, "category": category}])
    meta.execute(
        "INSERT INTO facts (id, user, category, text, importance, valid_from, "
        "transaction_time, last_access, last_verified, embedding_version) "
        "VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)",
        (fid, user, category, text, importance, valid_from, now, now, now, EMBEDDING_VERSION),
    )
    meta.commit()
    return {"op": "ADD", "id": fid}


# ---------- Active forgetting ----------
def forget_pass(user: str, threshold: float = 0.05, hard_cap: int = 5000) -> int:
    """Archive facts whose eviction score is below threshold, or whose count
    exceeds the hard cap (keeping the highest-scoring up to cap)."""
    now = time.time()
    rows = meta.execute(
        "SELECT * FROM facts WHERE user = ? AND archived = 0 AND invalid = 0",
        (user,),
    ).fetchall()
    if not rows:
        return 0

    scored = [(r, _eviction_score(r, now)) for r in rows]
    scored.sort(key=lambda x: x[1])
    archived_count = 0

    # Threshold-based: archive anything below the cutoff.
    for row, score in scored:
        if score >= threshold:
            break
        meta.execute("UPDATE facts SET archived = 1 WHERE id = ?", (row[0],))
        archived_count += 1

    # Capacity-based: enforce hard cap on remaining.
    remaining = [r for r, s in scored if s >= threshold]
    if len(remaining) > hard_cap:
        cut = remaining[: len(remaining) - hard_cap]
        for row in cut:
            meta.execute("UPDATE facts SET archived = 1 WHERE id = ?", (row[0],))
            archived_count += 1

    meta.commit()
    return archived_count


if __name__ == "__main__":
    # User starts at Acme on Postgres in March.
    write_with_resolver("alice", "User works at Acme on the data platform team.",
                        category="job_title", importance=0.8)
    write_with_resolver("alice", "User's primary database is Postgres.",
                        category="preference", importance=0.6)

    # April: user pivots — contradiction case.
    print(write_with_resolver(
        "alice", "User switched to Cassandra; team moved off Postgres last month.",
        category="preference", importance=0.7,
    ))

    # NOOP case: user re-states the same fact.
    print(write_with_resolver(
        "alice", "User works at Acme.",
        category="job_title", importance=0.7,
    ))

    # Forget pass — typically runs on a sleep-time schedule.
    print(f"archived: {forget_pass('alice')}")

Five things to notice. First, the resolver is opt-in per write, not per read — the LLM call lands on the write path, where it’s bounded (one per write against ≤8 candidates), not the read path where it’d compound across every retrieval. Second, DELETE doesn’t drop the row — the old fact gets invalid = 1, valid_to = now, and a superseded_by back-pointer; the row stays in the store for audit and historical-belief queries. Third, NOOP updates last_verified — even though no semantic change happened, the re-statement is evidence the fact is still current, and the verification boost rewards it on the next read. Fourth, the eviction score is computed lazily, not cached — eviction-relevant fields (importance, last_access, last_verified) change too often to maintain a precomputed score, and the scan is cheap enough at production sizes (linear in user’s fact count). Fifth, archiving is reversible — the row stays in the store with archived = 1, and an explicit cold-tier query can recover it; this is the difference between forgetting (soft state) and unlearning (hard deletion with audit trail).

Code: TypeScript — Drift-Adapter affine map between embedding versions

The companion sketch on the embedding-drift side: an affine map from one embedding model’s space to another, trained on a small sample of paired old/new embeddings. This is the Drift-Adapter approach in its simplest form (orthogonal Procrustes for an isometric map; the paper also covers low-rank affine and a residual MLP). Install: pnpm add @openai/openai mathjs.

typescript

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
// pnpm add openai mathjs
import OpenAI from "openai";
import { matrix, multiply, transpose, type Matrix } from "mathjs";

const openai = new OpenAI(); // OPENAI_API_KEY in env

const OLD_MODEL = "text-embedding-3-small";
const NEW_MODEL = "text-embedding-3-large";

async function embed(model: string, texts: string[]): Promise<number[][]> {
  const res = await openai.embeddings.create({ model, input: texts });
  return res.data.map((d) => d.embedding);
}

/**
 * Train an orthogonal Procrustes map from `oldEmbeddings` to `newEmbeddings`.
 * Returns the rotation matrix W such that newEmbeddings ≈ oldEmbeddings @ W.
 * Procrustes is the closed-form solution of min ||X W - Y|| s.t. W^T W = I.
 */
function fitProcrustes(oldVecs: number[][], newVecs: number[][]): Matrix {
  // For brevity this is the conceptual shape; a production fit uses an SVD.
  // sklearn's orthogonal_procrustes() or scipy.linalg.orthogonal_procrustes is
  // the canonical reference. The map is X^T Y -> SVD -> W = U V^T.
  const X = matrix(oldVecs);
  const Y = matrix(newVecs);
  const XtY = multiply(transpose(X), Y);
  // Plug in an SVD library here — e.g. svd-js — and return U @ V^T.
  // The full Drift-Adapter paper also covers a low-rank affine (Y = X A + b)
  // fit with regularization, which is what the production version uses.
  return XtY as Matrix; // placeholder — see note above
}

async function trainAdapter(sampleTexts: string[]): Promise<Matrix> {
  // 1. Embed the same sample with both models.
  const oldVecs = await embed(OLD_MODEL, sampleTexts);
  const newVecs = await embed(NEW_MODEL, sampleTexts);
  // 2. Fit the map: new-space query @ W^{-1} ≈ old-space vector.
  return fitProcrustes(oldVecs, newVecs);
}

async function queryWithAdapter(
  query: string,
  adapter: Matrix,
  oldIndex: { search: (vec: number[], k: number) => Promise<unknown[]> },
  k: number = 10,
): Promise<unknown[]> {
  // 1. Encode the query with the new model.
  const newQuery = (await embed(NEW_MODEL, [query]))[0];
  // 2. Project into the old space (apply the inverse map; for orthogonal
  //    Procrustes, the inverse is the transpose).
  const projected = multiply(matrix([newQuery]), transpose(adapter));
  // 3. Search against the existing old-vector index.
  return oldIndex.search(projected.toArray()[0] as number[], k);
}

// Training cadence: re-fit the adapter whenever the new model is updated or
// when the validation recall against a held-out set drops below threshold.
async function validate(
  heldOut: { query: string; relevantIds: string[] }[],
  adapter: Matrix,
  oldIndex: { search: (vec: number[], k: number) => Promise<{ id: string }[]> },
): Promise<{ recallAt10: number }> {
  let hits = 0;
  for (const { query, relevantIds } of heldOut) {
    const results = await queryWithAdapter(query, adapter, oldIndex, 10);
    const ids = (results as { id: string }[]).map((r) => r.id);
    if (ids.some((id) => relevantIds.includes(id))) hits++;
  }
  return { recallAt10: hits / heldOut.length };
}

The fit itself is the closed-form Procrustes solution — encode a few thousand documents with both v1 and v2, compute X^T Y, SVD-decompose to U Σ V^T, return W = U V^T. Production fits use a few thousand to a few tens of thousands of paired embeddings; the Drift-Adapter paper reports the map being stable with ~5,000 pairs and the marginal recall improvement past 10,000 pairs being well below 1%. The cost is two rounds of embedding the same training corpus (a one-time bill of ~$0.20 for 5,000 documents at small-model rates), plus the SVD. The validation runs the adapter against a held-out query set with known relevant doc IDs and measures recall@10; the alias swap happens when the adapter’s recall recovers to within 1-2% of the dual-encoder baseline.

The interesting failure mode the snippet doesn’t cover: if the new model has a larger dimensionality than the old (3-small → 3-large is 1536 → 3072), the affine map is from ℝ^{3072} → ℝ^{1536}, which is a projection (information is lost). The Drift-Adapter paper shows the projection recovers most of the recall in practice — the high-dimensional embedding has a lot of redundant capacity — but the residual gap is larger than the equal-dimensional case (1-3% rather than <1%). For a production migration where the new model is dimensionally larger, the dual-index path is often worth the cost.

Trade-offs, failure modes, and gotchas

The “two contradictions disagree” case. The resolver decides DELETE-and-replace on a write. A subsequent write contradicts the new fact, and the resolver again decides DELETE-and-replace. After two passes, the store has three facts in the supersession chain (A → B → C), all in the store with invalid and superseded_by stamps. Most retrieval paths follow the chain to the current head correctly. The trap is audit queries that walk the chain expecting a tree but find a chain of chains — production tooling that visualizes the supersession structure has to handle arbitrary-depth lineage, and the consolidator can flatten the chain to A → C (with B retained as a leaf for audit) when the intermediate is no longer being referenced.

The category-skew problem. Half-life-per-category is the right shape, but real workloads have category distributions that drift over time. A consumer assistant in 2026 may have 70% preference facts and 5% event facts; the same product in 2027 may have 50% event facts because the use case shifted toward task tracking. The fixed half-lives optimized in 2026 over-evict events in 2027. The mitigation is online recalibration — a background pass periodically refits the per-category half-lives against the empirical valid_to - valid_from distribution, and surfaces the new values for human review before promotion. Without it, the policy drifts silently.

The cold-tier scan problem. Archived facts are excluded from the default retrieval, but the cold tier still has to be query-able for explicit “do you remember when…” patterns. A naive implementation runs the cold-tier query against the same ANN index; the index size never shrinks; the eviction policy has saved storage but not retrieval cost. The mitigation is a separate cold-tier index — archived facts get moved to a second ANN index (or a sequential-scan store, for very cold facts) that’s only queried on explicit requests. The pattern is the same as the hierarchical-memory tier-2-versus-tier-3 split applied to the storage layer.

The Drift-Adapter dimensional-mismatch trap. When v2 has a larger embedding dimensionality than v1, the adapter is a projection and the residual recall gap is larger (1-3% vs. <1% for equal-dim cases). Workloads that depend on precise top-K ordering — e.g., a reranker downstream of the retrieval — see the gap concentrated in the long tail rather than the head, so the gap can be deceptively invisible in recall@10 but visible in NDCG over the full candidate set. The mitigation is to validate against the metric the downstream actually cares about, not just recall@10; the dual-index migration is the right answer when NDCG@K-full matters more than the migration cost.

The “we re-embedded once and forgot the version stamp” retrofit. A team re-embeds the entire store with v2 but doesn’t stamp the new embedding version on each row. Six months later, v3 lands; the team writes a Drift-Adapter from v2 to v3; the adapter is fit on the assumption that the store is fully v2. But a small fraction of the store is still v1 — backfill jobs that never finished, edge cases that the migration script missed, rows ingested from a stale snapshot. The adapter projects v3 queries into v2-space, and the v1 vectors in the store are still in v1-space, and the recall is silently lower than the validation set suggested. The mitigation is to stamp the embedding version at write time on every row, and to make the retrofit pipeline verify (not just compute) that every row’s version matches what the current adapter expects. The pattern is the schema-version stamp from migration-tooling best-practices, applied to embeddings.

The user-asserted-deletion-versus-supersession ambiguity. A user says “I never said that — please forget I ever told you about Y.” The agent should treat this as a hard deletion (the user is exercising the right to be forgotten on a specific memory), not as a supersession. The trap is the resolver treating it as a contradiction and writing a new fact “user denies Y” alongside the original — which is the opposite of what the user wanted. The mitigation is a deletion-intent classifier in front of the resolver: when the user explicitly invokes deletion language, route to the hard-delete pipeline rather than the supersession pipeline. The classifier is a small LLM call and has a separable failure mode (it’s a different question from “is this a contradiction”); production agents that conflate the two paths surface “I noted that you don’t recall saying X” responses that read as gaslighting. The cross-session identity article generalizes this to scoped hard-deletion of user profiles (“stop remembering anything about my job”) and the audit-log discipline that compliance demonstrations actually require; the memory privacy and multi-tenancy article is the full seven-step deletion pipeline (cache invalidation, derived-artifact rebuild, verification query, attestation) that an auditor actually checks for.

The verification-boost overweighting. A fact that’s been re-confirmed in the last 30 days gets freshness *= 1.3. A fact the user mentions every day for a week gets the boost stacked across multiple recent verifications — the implementation either over-credits (multiplies the boost per verification, blowing up the score) or correctly credits once (single threshold check on the most recent verification timestamp). The right shape is the latter: a binary “any verification in the recent window” rather than a sum across verifications, because the cognitive-psychology analogue (a memory reinforced by repeated retrieval) saturates rather than compounds linearly. Implementations that don’t enforce the saturation produce eviction-resistant “favorite memories” that never decay.

The contradiction-cascade in reflection. A reflection is built on three source episodes. One of the sources is contradicted by a later write. The reflection is now built on stale data, but the contradiction propagation only walked one hop — the source got marked INVALID, the reflection did not. The reflection continues to be retrieved and quoted. The mitigation is the depth-bounded walk on contradiction propagation — when a source is marked INVALID, walk the provenance graph (depth 2-3) and mark every dependent claim as needing revalidation. The revalidation is itself an LLM call (re-derive the claim from the remaining valid sources), and the consolidator is the right place to run it.

The embedding-version-on-cache-key problem. Most production retrieval caches the embedding for a query alongside the query string. After a model upgrade, the cache holds v1 embeddings but the index expects v2; cache hits return wrong results. The mitigation is to include the embedding-model version in the cache key (hash(query) + ":" + model_version), so the cache invalidates automatically on the model swap. The trap is that cache-key changes invalidate every existing entry, which spikes cache misses on the day of the migration. The mitigation is to pre-warm the new cache before the alias swap.

What to read next

Temporal Reasoning and Memory Provenance — the immediate predecessor: the bi-temporal substrate this article’s contradiction resolver and freshness scorer both depend on. The valid_to field that marks a contradicted fact, the last_verified clock that drives the verification boost, and the provenance graph that contradiction propagation walks.
Memory Retrieval Policies: Recency, Relevance, Importance — the read-side companion to this article’s write-side resolver and forgetting policy. The eviction-score formula here is the storage-side mirror of the retrieval-score formula there: the same Ebbinghaus decay, the same importance weight, the same verification boost, applied to deciding what stays in the store rather than what surfaces in the candidate set.
Memory Write Policies: What’s Worth Remembering — the upstream layer that decides whether a candidate write becomes a fact at all. The conflict resolver in this article runs after the write policy admits the candidate; if the write policy is well-calibrated, the resolver mostly sees clean ADD operations and the contradiction-handling load is bounded.
Sleep-Time Compute and Memory Consolidation — the regime where the background contradiction sweep and the forget pass both run. The hot-path resolver catches the per-write contradictions; the consolidator catches what the per-write retrieval missed, runs the forgetting policy at scale, and flattens supersession chains into summary nodes.