jatin.blog ~ $
$ cat ai-engineering/episode-segmentation.md

Episode Segmentation and Salience Scoring

Episode segmentation and salience scoring: prediction-error and topic-shift boundaries, anchored 1-10 importance, the event-sourcing aggregate parallel.

Jatin Bansal@blog:~/ai-engineering$ open episode-segmentation

A coding agent has been running for an hour. The session contains: forty turns debugging an authentication bug, a sharp pivot to a question about deployment, twelve turns on the deploy pivot, then a return to a third unrelated topic — what to log when a retry fails. Three hours later the user asks “what did we decide about the auth bug?” and the long-term episodic store needs to surface the right slice. If the store wrote one row per turn the retrieval is fuzzy — every keyword hit returns three loosely-related fragments and the rerank can’t tell which were on-topic for which sub-conversation. If the store wrote one row for the whole hour the retrieval is precise but useless — the auth-bug answer is buried inside fifty-two turns of mixed content and the model has to re-read the entire transcript to find it. Neither extreme works, and the gap between them is the layer this article is about. Where does one episode end and the next begin? And once segmented, how much should each episode weigh on the read-side rerank? Those two questions — segmentation and salience — are the upstream pair of decisions every memory write policy depends on, and the layer where most hand-rolled memory systems silently fail.

Opening bridge

Yesterday’s memory write policies piece named the four-stage write pipeline — triage, distill, dedupe, persist — and treated the unit of admission as a given input to the gate. The long-term memory piece had already named the unit-of-recall problem in passing: per-message, per-exchange, per-session, or extracted-fact granularity, and the trade-off between fidelity and precision at each level. What both pieces deferred is the actual mechanics of deciding the unit — the segmentation algorithm — and the actual mechanics of scoring it once written — the salience function. Today’s article is the deep dive on those two upstream operations. Treat it as the prerequisite that makes the rest of the write/read stack work: segmentation decides what’s in a row, salience decides how much that row weighs on the rerank. Without anchored answers to both, every downstream tuning knob on the memory subsystem is tuning a moving baseline.

Definition

Episode segmentation is the operation that partitions a continuous stream of agent observations (user turns, assistant turns, tool calls, tool results) into discrete episodes — bounded, indexable units that each represent a coherent sub-task or sub-topic. The output is a sequence of {start_ts, end_ts, content, type} records, where each record’s content is internally coherent and the boundaries between records correspond to meaningful shifts in topic, goal, or context. Three properties separate a segmentation from arbitrary chunking. First, it is content-aware — the boundaries are chosen because something changed (topic, user goal, system state), not because a fixed window expired. Second, it is hierarchical-tolerant — sub-episodes nest inside larger episodes (the auth-bug episode contains a debugger-output sub-episode contains a stack-trace observation), and the segmentation algorithm picks the right level for its downstream consumer. Third, it is non-destructive — the raw turn-by-turn journal is preserved upstream; segmentation is a view over the journal, not a replacement for it.

Salience scoring is the operation that assigns each episode a single scalar — typically in [0, 1] or [1, 10] — representing how load-bearing the episode is on future recall. A high-salience episode is one that meaningfully changes what the agent knows about the user, the task, or the world; a low-salience episode is conversational filler, a small clarification, a routine acknowledgement. The score is consumed by the read-side rerank (the α·recency + β·importance + γ·similarity formulation introduced in the long-term memory piece), by the reflection pass (whose importance-sum threshold trigger consumes the same salience signal), and by maintenance passes (demotion, forgetting, archival). The single most important property of a useful salience score is that it is anchored — the scale’s endpoints are explicitly pinned, so a 7 means the same thing across episodes, sessions, and users.

What these two are not. Segmentation is not chunking — chunking strategies operate on static documents at index time with no notion of agent state; segmentation operates on streaming dialogue with explicit topic, goal, and turn-type signals available. Salience is not confidence — confidence is “how sure am I this fact is true,” salience is “how much will this fact matter to future reads.” Both segmentation and salience are upstream of the admission-control gate: you have to know what unit you’re admitting and how it will be weighted before the admission decision is even well-posed.

Intuition

The mental model that pays off is event-driven log compaction with a learned weight per event. The continuous turn-by-turn journal is the raw event stream; segmentation runs as a stateful pass over the stream, emitting a coarser event at each detected boundary; salience scores the coarser event so that downstream consumers can prioritize. The shape is exactly Kafka log compaction with an importance field, or — closer to the cognitive parallel — the way humans don’t remember every second of a day, they remember it as a sequence of episodes (the morning meeting, the lunch conversation, the afternoon debugging session), and the brain weights each episode by how much it mattered at the time.

Two design questions force themselves on every implementation. The first is what triggers a boundary? The five practical signals, in roughly increasing sophistication: a fixed-window timeout (every N turns or every M minutes), a semantic-shift threshold (cosine distance between consecutive turn embeddings exceeds a cutoff), a prediction-error signal (the next turn doesn’t fit the model’s running expectation), a structural turn-type marker (the user types a question that doesn’t follow from the prior thread; a tool call returns an error that opens a new sub-task), or an explicit agent-emitted marker (the agent decides “this is a new topic” and emits a <episode_boundary> token). Each has a different cost profile and a different failure mode. The mature 2026 production answer is a hybrid — the cheap structural signals are the fast path, the semantic-shift signal is the safety net, the agent-emitted markers are the override for the cases the structural signals miss.

The second question is at what level? Episodes are naturally hierarchical: a turn (one user message), an exchange (one user + one assistant turn), a task-sub-block (auth-bug-debugging), a session (the hour-long coding conversation). Each level is a legitimate unit for some workload, and a robust segmentation pass annotates the journal at multiple levels simultaneously — the retrieval pass can then choose its level based on the query shape. The same way event segmentation theory in cognitive neuroscience shows that humans segment activity at multiple timescales simultaneously, the agent’s segmentation should too.

Cognitive grounding — the parallel from event segmentation theory

The neuroscience here is unusually relevant, because the architecture the brain uses to segment continuous experience into discrete episodes is the same architecture that has converged in production agent memory. Zacks and Swallow’s event segmentation theory (Zacks et al., 2007; multiple follow-ups through 2024) argues that human perception is continuously running a predictive model of the immediate near-future, and that a perceived event boundary corresponds to a moment of high prediction error — the model’s expectation diverges sharply from what actually happens. Activity in posterior temporal and parietal cortex spikes transiently at these boundaries; the hippocampus uses the boundary as the natural unit of encoding for episodic memory; subsequent recall of the episode is conditional on the boundary having been correctly detected at encoding time.

Two implications port directly. First, prediction-error is a load-bearing segmentation signal — when the next turn surprises a running expectation, that’s a boundary worth marking. In agent terms: maintain a small running summary of “what is this conversation about” (the working-memory scratchpad is the natural place to hold it), and when a new turn semantically diverges from that summary above a threshold, emit a boundary. The recent cognitive-neuroscience work on prediction-error-driven segmentation makes this concrete: error-driven boundaries are associated with pattern shifts in ventrolateral prefrontal areas (the brain’s working-memory hub), while uncertainty-driven boundaries are linked to shifts in parietal attention networks. The two-mode pattern shows up in production agents too — a topic shift is one signal, a task-failure-into-recovery is a different signal, and a robust segmentor distinguishes them.

Second, memory is encoded around boundaries, not within them. The hippocampus shows ripple-like reactivation activity at event offsets, and this offset-time activity predicts subsequent recall of the just-finished event (Sols et al. 2017, Reagh & Ranganath 2024). The agent-architecture parallel: the right time to run the memory write policy’s extract stage is at the boundary, not at every turn. The boundary is where the model has the full context of the just-completed sub-task and can write the highest-quality distilled fact; running extraction at every turn produces redundant micro-facts that have to be deduped later. Boundary-triggered extraction is the closest agent-architecture analog of the brain’s offset-locked consolidation, and it’s the pattern A-MEM and Letta have converged on.

The distributed-systems parallel — event-sourcing aggregate boundaries

The cleanest distributed-systems parallel is the aggregate boundary in event-sourced systems. In Vaughn Vernon’s framing, an aggregate is the unit of transactional consistency — every event in a single aggregate is committed atomically, and consistency across aggregates is eventual. The aggregate’s boundary is the line on which the system decides “this is one consistent thing.” Get the aggregate boundary wrong — too large, and contention kills throughput; too small, and consistency invariants leak across aggregates — and the system has architectural problems that no tuning of the storage layer can fix.

Agent episodes recapitulate this exactly. An episode is the transactional consistency unit of memory: the things inside one episode are committed and reasoned about together (the auth-bug debug session, all twelve turns of it, retrieves as one unit); the things in different episodes are reasoned about separately, even when they share keywords (the auth-bug session and the deploy-pivot session both mention “production” but they should not blur together in the rerank). An episode that’s too large is the all-day-conversation row that buries the auth-bug answer in fifty-two unrelated turns. An episode that’s too small is the per-turn row that splits the auth-bug discussion across forty-eight rows and forces the rerank to stitch them back together — usually badly. The aggregate-boundary intuition from DDD ports over without modification: the segmentation algorithm is choosing the consistency boundary of the memory subsystem, and the same heuristics (“one aggregate per use case, one transaction per aggregate”) tell you when the boundary is right.

Two further parallels worth naming. Event-sourcing’s append-only journal is the per-turn raw log; the read model is the segmented + scored episode store that retrieval queries against. The pattern in production is the same: write everything to the journal, project the journal into one or more read models, query the read models. The segmentation pass is a projection; the salience score is a column on the projection; the rerank reads off the projection, never off the raw journal. Kafka’s log compaction is the maintenance pass that goes back through the raw journal and replaces sequences of fine-grained events with the post-segmentation coarsened versions — exactly the role of the sleep-time consolidation pass the rest of the memory subsystem runs.

Five segmentation signals

The five practical signals, ordered from cheapest to most-sophisticated. Production systems usually hybridize three or four of them; the goal of knowing them individually is to know which signal is failing when the segmentor mis-fires.

1 — Fixed-window timeout (the baseline). Every N turns (typically 10-20) or every M minutes of clock-time, emit a boundary. Cheapest possible policy; zero model calls; trivially implementable. The failure mode is the mid-task split — a single coherent sub-task that runs longer than N turns gets cut in half, and the second half loses the context of the first half. Worth shipping as the floor of any segmentation system; it guarantees segments don’t grow unboundedly even when other signals fail.

2 — Semantic-shift threshold. Embed each turn (or each user+assistant exchange) and compute the cosine distance to a running representation of the current episode — typically the mean embedding of the last K turns. When the distance exceeds a threshold (commonly 0.3-0.5 with Sentence-Transformers-style embeddings), emit a boundary. Costs one embedding call per turn (~$0.0001 at current rates) and a vector arithmetic op. The failure mode is the gradual drift — a conversation that slowly transitions from one topic to another over five turns never triggers the threshold on any single turn, and the boundary lands in the wrong place or is missed entirely. Mitigation: combine with the fixed-window floor to bound the worst case, and decay the running mean toward the most recent turns to make the threshold more reactive to sustained shifts.

3 — Prediction-error boundary. Maintain a one-line running summary of the current episode (kept in working memory). Before each new turn, the model is implicitly predicting “what comes next in this conversation”; when the new turn diverges from that prediction above a threshold, emit a boundary. Concretely: every K turns, run a small-model prompt — “Here’s the current episode summary: <summary>. Here’s the next turn: <turn>. Does this turn continue the episode or start a new one?” The model emits continue|new|sub-episode plus an updated summary. Costs a small-model call every K turns (K=3-5 in practice). This is the closest direct port of the cognitive-neuroscience prediction-error mechanism, and it catches the gradual-drift case that pure semantic-shift misses — by the time five turns have drifted from the summary, the summary is the wrong anchor and the prediction-error flag fires explicitly.

4 — Structural turn-type marker. Tag each turn with its type (user-question, user-correction, assistant-answer, tool-call, tool-error, tool-success, system-event) and apply rules: a tool-error followed by a user-question opens a new sub-episode; a system-event (new file uploaded, new tool granted) is always a boundary; a user-correction typically modifies the current episode rather than starting a new one. Costs nothing at runtime if the turn types are emitted by the orchestration layer (which a well-built agent harness already does); costs a small-classifier call per turn otherwise. The failure mode is type-fidelity — if the turn-type tagger is wrong, the structural signal misleads the segmentor; mitigation is to treat structural markers as hints to the higher-level segmentor rather than authoritative boundaries.

5 — Agent-emitted marker. The agent itself emits explicit <episode_boundary> or <new_subtask> tokens during its own reasoning, and the segmentor reads them as authoritative. Costs essentially nothing (no separate model call; the agent emits the marker as part of its normal output) and has the highest fidelity because the agent is the closest possible observer of its own task structure. The failure mode is agent forgetting to emit — the agent gets absorbed in a sub-task and never marks the boundary; mitigation is to combine with signals 1-4 as fallbacks, so a missed agent marker is caught by the running prediction-error or semantic-shift detector. This is the pattern the Anthropic agent SDK ships in the compact tool and similar — letting the agent self-segment is the highest-quality signal when it works, but never the only signal.

The defensible production stack runs all five: a structural-marker fast path (signal 4) catches the easy cases at zero cost; an agent-emitted marker (signal 5) catches the cases the agent self-identifies; a periodic prediction-error check (signal 3) every 3-5 turns catches the gradual drifts; a continuous semantic-shift detector (signal 2) catches sharp shifts the others miss; a fixed-window timeout (signal 1) bounds the worst case. Any single signal alone has a known failure mode; the layered combination handles each other’s blind spots.

Salience scoring — the anchored 1-10 prompt

The salience score is the second output of the segmentation pass: once you have an episode, how much should it weigh on future recall? The canonical formulation, from Park et al.’s Generative Agents paper, is the 1-10 poignancy prompt with explicit endpoint anchoring. The exact text from the paper’s §A.1:

“On the scale of 1 to 10, where 1 is purely mundane (e.g., brushing teeth, making bed) and 10 is extremely poignant (e.g., a break up, college acceptance), rate the likely poignancy of the following piece of memory. Memory: <episode>. Rating: <fill in>.”

The anchoring is the whole architecture. Without it, LLM-rated importance scores cluster at 6-8 regardless of content, the rerank’s importance term becomes near-constant, and the entire α·recency + β·importance + γ·similarity formula collapses to α·recency + γ·similarity — losing one-third of its information. With anchoring, the model has explicit endpoints to calibrate against — “is this more like brushing teeth or more like college acceptance?” — and the resulting distribution is wide enough that the rerank can actually use it. The anchoring works because the model has read enough text to know what a 1 looks like and what a 10 looks like; pinning the scale lets the rest of the range calibrate itself against the named endpoints.

Three patterns separate a useful salience scorer from a useless one.

Pattern 1 — Anchor both endpoints with concrete examples. “1 = mundane, 10 = significant” is too abstract; “1 = brushing teeth, 10 = getting divorced” pins the scale. Use endpoints that are vivid enough that the model can’t read them as the same thing. For an agent in a specific domain, replace the Generative Agents’ personal-life endpoints with domain-appropriate ones — “1 = a one-line clarifying question, 10 = a critical production incident root-cause” for an SRE agent; “1 = boilerplate getter method, 10 = the core algorithm of the system” for a coding agent. Domain-specific anchors produce sharper scores than generic ones.

Pattern 2 — Make the score deterministic. Run the scorer with temperature=0 and a tightly-constrained output schema (Literal[1, 2, ..., 10] or JSON-schema-constrained). Free-form generation will produce “7-8, depending on context” answers that the downstream pipeline has to parse; structured output forces a single integer and the rerank can rely on it.

Pattern 3 — Score at the right unit. Score the episode, not the turn. Per-turn scoring is the most common production mistake — it inflates the rerank’s bias toward whichever turns happened to score high and makes the salience term roughly equivalent to a noisy reweighting of similarity. Per-episode scoring is what the Generative Agents paper actually does, and it’s the unit at which “how much did this matter?” is a well-posed question. The segmentor emits boundaries first; the salience scorer runs once per emitted segment.

A non-obvious refinement from the MemoryBank paper (Zhong, Guo et al., AAAI 2024) and the Generative Agents follow-up work: the salience score is updated at retrieval time. Each successful retrieval of an episode increments a recall-count or resets a recency-decay term — exactly the LRU/LFU hybrid the hierarchical memory piece named. The write-time score is the starting weight; the read pattern adjusts it. An episode written with importance=3 that gets retrieved 50 times has demonstrated more salience than the original score reflects; the maintenance pass should reflect that. Conversely, an importance=8 episode that’s never retrieved is a sign the write-time score was too high or the topic became irrelevant; either way it should decay.

When the segmentation runs: streaming vs deferred

Two architectural choices, the same shape as the write policy timing question.

Streaming segmentation runs the segmentor inline with the agent loop. Every turn triggers signals 1-4 above; when any signal fires, the segmentor closes the current segment, runs the extract+salience pass over it, writes the result to the episodic store, and opens a new segment. Pros: the segment is queryable as soon as it’s closed — same-session retrieval works for content the user mentioned 10 minutes ago. Cons: the per-turn latency includes the segmentation-check cost (sub-100ms with cheap signals, but each signal adds to the tail); a mis-fired boundary creates a small noisy segment that has to be merged later.

Deferred segmentation runs the segmentor at session-end or in a background pass over the raw journal. The agent loop only writes to the journal; segmentation, extraction, and salience all run later. Pros: zero per-turn cost; the segmentor sees the whole session and produces higher-quality segments; can use larger, slower models than the foreground agent can afford. Cons: same-session retrieval works only on the in-context conversation buffer, not the segmented store; the segmented store lags the journal by the deferred-pass interval (seconds for hot-path-deferred, minutes for batch-deferred, hours for nightly).

The pattern that wins in production is streaming for the cheap signals, deferred for the expensive ones: structural markers (signal 4) and agent-emitted markers (signal 5) run inline at zero cost; the prediction-error and semantic-shift checks (signals 2-3) run inline at light cost; the full extract+salience+dedup pass runs deferred. The result is that segments are queryable within seconds of close (good enough for same-session continuity) and the expensive computation happens on the deferred path (good for cost). This is the same hot-path/deferred split Mem0’s 2026 redesign made for the write pipeline overall — the segmentation choice is the same architecture one layer up.

Code: Python — a hybrid segment-and-score pipeline

The smallest interesting build: a streaming segmentor that combines signals 1, 2, and 3 (fixed-window, semantic-shift, prediction-error) and a Generative-Agents-style salience scorer. Writes to Chroma for the episodic store. Uses the Anthropic SDK for both the prediction-error check and the salience score, Sentence-Transformers for the embedding-based semantic-shift detector. Install: pip install anthropic chromadb sentence-transformers.

python
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
import os
import json
import time
import uuid
from dataclasses import dataclass, field
from typing import Literal

import chromadb
import numpy as np
from anthropic import Anthropic
from sentence_transformers import SentenceTransformer

client = Anthropic()
embed_model = SentenceTransformer("all-MiniLM-L6-v2")
chroma = chromadb.PersistentClient(path="./episode_store")
episodes = chroma.get_or_create_collection("episodes")

# --------- tunables ---------
MAX_SEGMENT_TURNS = 15           # signal 1 — fixed-window floor
SEMANTIC_SHIFT_THRESHOLD = 0.45  # signal 2 — cosine distance to running mean
PREDICTION_CHECK_EVERY = 4       # signal 3 — small-model boundary check cadence
SCORER_MODEL = "claude-haiku-4-5"
SUMMARIZER_MODEL = "claude-haiku-4-5"


@dataclass
class Turn:
    actor: Literal["user", "assistant", "tool"]
    text: str
    ts: float


@dataclass
class Segment:
    user: str
    turns: list[Turn] = field(default_factory=list)
    embeds: list[np.ndarray] = field(default_factory=list)
    summary: str = ""  # one-line running summary
    opened_at: float = field(default_factory=time.time)


# --------- signal 2: semantic-shift detector ---------
def semantic_shift(seg: Segment, turn_embed: np.ndarray) -> float:
    """Cosine distance of new turn to running mean of segment."""
    if not seg.embeds:
        return 0.0
    running = np.mean(seg.embeds[-5:], axis=0)
    cos_sim = np.dot(running, turn_embed) / (
        np.linalg.norm(running) * np.linalg.norm(turn_embed) + 1e-9
    )
    return 1.0 - cos_sim


# --------- signal 3: prediction-error check via small model ---------
def prediction_error_check(seg: Segment, new_turn: Turn) -> tuple[bool, str]:
    """Returns (is_boundary, updated_summary)."""
    if not seg.summary:
        # First check: build the running summary, no boundary yet.
        return False, _summarize_segment(seg)

    prompt = (
        f"Current episode summary: {seg.summary}\n\n"
        f"New turn ({new_turn.actor}): {new_turn.text}\n\n"
        "Does this turn continue the current episode or start a new one?\n"
        'Return JSON: {"decision":"continue"|"new","updated_summary":"<one line>"}'
    )
    resp = client.messages.create(
        model=SCORER_MODEL,
        max_tokens=200,
        temperature=0,
        messages=[{"role": "user", "content": prompt}],
    )
    raw = resp.content[0].text.strip()
    try:
        parsed = json.loads(raw[raw.index("{") : raw.rindex("}") + 1])
    except (ValueError, json.JSONDecodeError):
        return False, seg.summary
    return parsed["decision"] == "new", parsed["updated_summary"]


def _summarize_segment(seg: Segment) -> str:
    if not seg.turns:
        return ""
    transcript = "\n".join(f"{t.actor}: {t.text}" for t in seg.turns[-6:])
    resp = client.messages.create(
        model=SUMMARIZER_MODEL,
        max_tokens=80,
        temperature=0,
        messages=[
            {
                "role": "user",
                "content": (
                    "Summarize the topic of this dialogue in one short line "
                    f"(<=15 words):\n{transcript}"
                ),
            }
        ],
    )
    return resp.content[0].text.strip()


# --------- salience scorer (anchored 1-10) ---------
def score_salience(seg: Segment, domain: str = "general") -> int:
    transcript = "\n".join(f"{t.actor}: {t.text}" for t in seg.turns)
    if domain == "coding":
        anchors = (
            "1 = trivial syntactic clarification or boilerplate; "
            "10 = the core algorithm of the system or a critical production fix"
        )
    else:
        anchors = (
            "1 = purely mundane (brushing teeth, making bed); "
            "10 = extremely poignant (a break up, college acceptance)"
        )
    prompt = (
        f"On the scale of 1 to 10, where {anchors}, rate the likely poignancy "
        f"of the following episode for future recall.\n\nEpisode:\n{transcript}\n\n"
        'Return JSON: {"score": <integer 1-10>}'
    )
    resp = client.messages.create(
        model=SCORER_MODEL,
        max_tokens=80,
        temperature=0,
        messages=[{"role": "user", "content": prompt}],
    )
    raw = resp.content[0].text.strip()
    try:
        return int(json.loads(raw[raw.index("{") : raw.rindex("}") + 1])["score"])
    except (ValueError, json.JSONDecodeError, KeyError):
        return 5


# --------- close-segment and persist ---------
def close_and_persist(seg: Segment, domain: str = "general") -> str:
    if not seg.turns:
        return ""
    salience = score_salience(seg, domain)
    transcript = "\n".join(f"{t.actor}: {t.text}" for t in seg.turns)
    seg_embed = embed_model.encode(seg.summary or transcript[:500]).tolist()
    seg_id = str(uuid.uuid4())
    episodes.add(
        ids=[seg_id],
        embeddings=[seg_embed],
        documents=[transcript],
        metadatas=[
            {
                "user": seg.user,
                "summary": seg.summary,
                "salience": salience / 10.0,
                "opened_at": seg.opened_at,
                "closed_at": time.time(),
                "n_turns": len(seg.turns),
                "last_read": time.time(),
            }
        ],
    )
    return seg_id


# --------- streaming segmentor ---------
class Segmentor:
    def __init__(self, user: str, domain: str = "general") -> None:
        self.user = user
        self.domain = domain
        self.current = Segment(user=user)
        self.turn_count = 0

    def observe(self, turn: Turn) -> str | None:
        """Process one turn. Returns segment_id if a boundary closed a segment."""
        self.current.turns.append(turn)
        emb = embed_model.encode(turn.text)
        self.current.embeds.append(emb)
        self.turn_count += 1

        boundary = False
        # signal 1 — fixed-window floor
        if len(self.current.turns) >= MAX_SEGMENT_TURNS:
            boundary = True
        # signal 2 — semantic shift
        elif semantic_shift(self.current, emb) > SEMANTIC_SHIFT_THRESHOLD:
            boundary = True
        # signal 3 — prediction-error check (cheaper cadence)
        elif self.turn_count % PREDICTION_CHECK_EVERY == 0:
            is_new, updated_summary = prediction_error_check(self.current, turn)
            self.current.summary = updated_summary
            if is_new:
                boundary = True

        if not boundary:
            return None

        # The current turn opens the *new* segment; close the prior one.
        closing = self.current
        closing.turns = closing.turns[:-1]
        closing.embeds = closing.embeds[:-1]
        seg_id = close_and_persist(closing, self.domain)
        self.current = Segment(user=self.user)
        self.current.turns.append(turn)
        self.current.embeds.append(emb)
        return seg_id

    def flush(self) -> str | None:
        """Close any open segment (e.g., at session-end)."""
        if not self.current.turns:
            return None
        seg_id = close_and_persist(self.current, self.domain)
        self.current = Segment(user=self.user)
        return seg_id


# --------- minimal usage ---------
if __name__ == "__main__":
    seg = Segmentor(user="alice", domain="coding")
    convo = [
        ("user", "I'm getting a 401 on the auth callback. Stack trace?"),
        ("assistant", "Looks like the JWT signing key got rotated."),
        ("user", "Where do I check that?"),
        ("assistant", "config/auth.yml, look for jwt_secret_rotation."),
        ("user", "Yep — the rotation timestamp is from yesterday."),
        # Sharp topic pivot here
        ("user", "Different question — how do I bump the prod deploy version?"),
        ("assistant", "Edit infra/prod/version.tf and run terraform apply."),
        ("user", "Got it, applying now."),
    ]
    for actor, text in convo:
        sid = seg.observe(Turn(actor=actor, text=text, ts=time.time()))
        if sid:
            print(f"Closed segment: {sid}")
    final = seg.flush()
    if final:
        print(f"Final segment: {final}")

Three things to notice. First, the segmentor is stateful per user — segments belong to specific users and don’t cross tenants; the user field on each segment is what makes the read-side retrieval safe to filter on. Second, the prediction-error check runs every K turns, not every turn — the model call is the expensive signal, and the cheap signals (fixed-window, semantic-shift) carry the load on the in-between turns. The cadence is a tunable; K=4 is a defensible starting point for a small-model check. Third, the salience score is normalized to [0, 1] before storage — the downstream rerank from the long-term memory piece expects normalized terms, and storing the raw 1-10 integer just defers the normalization to read time. Normalize at write; the cost is identical and the read path stays simpler.

Code: TypeScript — same shape, Vercel-AI-SDK flavor

The TypeScript implementation against Chroma’s JS client, the Anthropic SDK, and a hand-rolled cosine-similarity helper. Install: pnpm add chromadb @anthropic-ai/sdk @xenova/transformers.

typescript
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
import Anthropic from "@anthropic-ai/sdk";
import { ChromaClient } from "chromadb";
import { pipeline } from "@xenova/transformers";

const client = new Anthropic();
const chroma = new ChromaClient({ path: "http://localhost:8000" });

const MAX_SEGMENT_TURNS = 15;
const SEMANTIC_SHIFT_THRESHOLD = 0.45;
const PREDICTION_CHECK_EVERY = 4;
const SCORER = "claude-haiku-4-5";

type Actor = "user" | "assistant" | "tool";
type Turn = { actor: Actor; text: string; ts: number };
type Segment = {
  user: string;
  turns: Turn[];
  embeds: Float32Array[];
  summary: string;
  openedAt: number;
};

let extractor: any;
async function embed(text: string): Promise<Float32Array> {
  if (!extractor) {
    extractor = await pipeline(
      "feature-extraction",
      "Xenova/all-MiniLM-L6-v2"
    );
  }
  const out = await extractor(text, { pooling: "mean", normalize: true });
  return out.data as Float32Array;
}

function cosineDistance(a: Float32Array, b: Float32Array): number {
  let dot = 0,
    na = 0,
    nb = 0;
  for (let i = 0; i < a.length; i++) {
    dot += a[i] * b[i];
    na += a[i] * a[i];
    nb += b[i] * b[i];
  }
  return 1 - dot / (Math.sqrt(na) * Math.sqrt(nb) + 1e-9);
}

function runningMean(embeds: Float32Array[]): Float32Array {
  if (embeds.length === 0) return new Float32Array();
  const dim = embeds[0].length;
  const m = new Float32Array(dim);
  const recent = embeds.slice(-5);
  for (const e of recent) for (let i = 0; i < dim; i++) m[i] += e[i] / recent.length;
  return m;
}

async function summarizeSegment(seg: Segment): Promise<string> {
  const transcript = seg.turns
    .slice(-6)
    .map((t) => `${t.actor}: ${t.text}`)
    .join("\n");
  const resp = await client.messages.create({
    model: SCORER,
    max_tokens: 80,
    temperature: 0,
    messages: [
      {
        role: "user",
        content: `Summarize the topic of this dialogue in one short line (<=15 words):\n${transcript}`,
      },
    ],
  });
  const block = resp.content[0];
  return block.type === "text" ? block.text.trim() : "";
}

async function predictionErrorCheck(
  seg: Segment,
  turn: Turn
): Promise<{ isNew: boolean; updatedSummary: string }> {
  if (!seg.summary) {
    return { isNew: false, updatedSummary: await summarizeSegment(seg) };
  }
  const prompt =
    `Current episode summary: ${seg.summary}\n\n` +
    `New turn (${turn.actor}): ${turn.text}\n\n` +
    `Does this turn continue the current episode or start a new one?\n` +
    `Return JSON: {"decision":"continue"|"new","updated_summary":"<one line>"}`;
  const resp = await client.messages.create({
    model: SCORER,
    max_tokens: 200,
    temperature: 0,
    messages: [{ role: "user", content: prompt }],
  });
  const block = resp.content[0];
  const raw = block.type === "text" ? block.text.trim() : "";
  try {
    const j = JSON.parse(raw.slice(raw.indexOf("{"), raw.lastIndexOf("}") + 1));
    return { isNew: j.decision === "new", updatedSummary: j.updated_summary };
  } catch {
    return { isNew: false, updatedSummary: seg.summary };
  }
}

async function scoreSalience(seg: Segment, domain = "general"): Promise<number> {
  const transcript = seg.turns.map((t) => `${t.actor}: ${t.text}`).join("\n");
  const anchors =
    domain === "coding"
      ? "1 = trivial syntactic clarification; 10 = the core algorithm or a critical fix"
      : "1 = mundane (brushing teeth); 10 = extremely poignant (a break up)";
  const prompt =
    `On the scale of 1 to 10, where ${anchors}, rate the likely poignancy ` +
    `of the following episode for future recall.\n\nEpisode:\n${transcript}\n\n` +
    `Return JSON: {"score": <integer 1-10>}`;
  const resp = await client.messages.create({
    model: SCORER,
    max_tokens: 80,
    temperature: 0,
    messages: [{ role: "user", content: prompt }],
  });
  const block = resp.content[0];
  const raw = block.type === "text" ? block.text.trim() : "";
  try {
    return JSON.parse(raw.slice(raw.indexOf("{"), raw.lastIndexOf("}") + 1)).score;
  } catch {
    return 5;
  }
}

export async function buildSegmentor(user: string, domain = "general") {
  const collection = await chroma.getOrCreateCollection({ name: "episodes" });
  let current: Segment = {
    user,
    turns: [],
    embeds: [],
    summary: "",
    openedAt: Date.now(),
  };
  let turnCount = 0;

  async function close(): Promise<string | null> {
    if (current.turns.length === 0) return null;
    const salience = await scoreSalience(current, domain);
    const transcript = current.turns
      .map((t) => `${t.actor}: ${t.text}`)
      .join("\n");
    const segEmbed = Array.from(
      await embed(current.summary || transcript.slice(0, 500))
    );
    const id = crypto.randomUUID();
    await collection.add({
      ids: [id],
      embeddings: [segEmbed],
      documents: [transcript],
      metadatas: [
        {
          user,
          summary: current.summary,
          salience: salience / 10,
          opened_at: current.openedAt,
          closed_at: Date.now(),
          n_turns: current.turns.length,
          last_read: Date.now(),
        },
      ],
    });
    return id;
  }

  return {
    async observe(turn: Turn): Promise<string | null> {
      current.turns.push(turn);
      const emb = await embed(turn.text);
      current.embeds.push(emb);
      turnCount++;

      let boundary = false;
      if (current.turns.length >= MAX_SEGMENT_TURNS) {
        boundary = true;
      } else if (
        cosineDistance(runningMean(current.embeds.slice(0, -1)), emb) >
        SEMANTIC_SHIFT_THRESHOLD
      ) {
        boundary = true;
      } else if (turnCount % PREDICTION_CHECK_EVERY === 0) {
        const { isNew, updatedSummary } = await predictionErrorCheck(
          current,
          turn
        );
        current.summary = updatedSummary;
        if (isNew) boundary = true;
      }
      if (!boundary) return null;

      const closing: Segment = {
        ...current,
        turns: current.turns.slice(0, -1),
        embeds: current.embeds.slice(0, -1),
      };
      const prev = current;
      current = {
        user,
        turns: [turn],
        embeds: [emb],
        summary: "",
        openedAt: turn.ts,
      };
      // Re-bind the closing segment for the persist call.
      const segId = await (async () => {
        const tmp = current;
        current = closing;
        const id = await close();
        current = tmp;
        return id;
      })();
      return segId;
    },
    async flush(): Promise<string | null> {
      const id = await close();
      current = {
        user,
        turns: [],
        embeds: [],
        summary: "",
        openedAt: Date.now(),
      };
      return id;
    },
  };
}

The TypeScript port has the same architecture: cheap signals (fixed-window, semantic-shift) inline at every turn; the prediction-error model call gated to every K turns; the salience score run once per closed segment with explicit anchoring. The one wrinkle worth flagging is the embedding source — @xenova/transformers runs MiniLM locally in WASM, which keeps the per-turn cost at zero and the latency at sub-millisecond once the model is loaded; the alternative is to call a hosted embedding endpoint per turn, which adds 50-200ms of network and a fraction of a cent per call. For a streaming segmentor where every turn pays the embedding cost, local inference is almost always the right answer.

Trade-offs, failure modes, and gotchas

The mid-task split. A long sub-task that exceeds MAX_SEGMENT_TURNS is cut in the middle, and the second half loses the context of the first half. The retrieved second-half segment is incomplete; the rerank gives it a lower similarity hit; the user’s “what did we decide?” question lands the wrong segment. Mitigation: increase MAX_SEGMENT_TURNS (20-25 is usually safe for technical conversations), or — better — emit a <continuation_of:prev_segment_id> link when a fixed-window boundary fires, so the retrieval pass can stitch related segments back together. The continuation link is a write-time hint that the read path can follow when it lands a fragment.

The gradual-drift miss. A conversation that transitions slowly from one topic to another never crosses the SEMANTIC_SHIFT_THRESHOLD on any single turn, and the boundary lands either nowhere (one giant blurred segment) or at the wrong place (a cosine outlier turn that happens to score above threshold but isn’t actually a topic boundary). Mitigation: combine the per-turn shift with the per-K-turn prediction-error check — the prediction-error signal compares against the running summary of the segment, not the running mean of embeddings, so it catches semantic drift that the cosine signal misses.

The salience-collapse failure. Without anchored endpoints, the salience prompt returns 6-7-8 for almost every episode. The rerank’s importance term has near-zero variance; the formula collapses to recency + similarity. The single best diagnostic is to plot the salience distribution after the first 1000 episodes — if it’s a tight bell around 7, the anchoring is failing. The fix is sharper, more vivid endpoint examples in the prompt — and almost always, domain-specific anchors rather than the generic personal-life ones from Generative Agents.

The per-turn salience anti-pattern. Scoring each turn instead of each segment is the most common production mistake. Per-turn salience is meaningless (“how poignant is 'okay'?” is not a well-posed question) and inflates compute cost by 10-20x. Always score at the segment level; if you need turn-level information, store it as metadata on the segment, not as a separate score.

The agent-marker-only failure. Trusting only signal 5 (agent-emitted boundaries) sounds elegant but fails when the agent forgets to emit. In practice, agents reliably emit boundary markers ~40-60% of the time without explicit prompting; the missing markers are exactly the cases where the segmentor is most needed. Always layer signals 1-4 underneath as fallbacks.

The deferred-segmentation lag bug. A session-end-only segmentor leaves same-session recall broken — the user references something said 5 minutes ago, and the only place it’s retrievable is the in-context conversation buffer. If the buffer has already evicted that turn (because of short-term memory truncation), the recall fails entirely. The journal-and-checkpoint pattern from the memory-write-policies piece is the cleanest fix: stream the raw journal so it’s always queryable, defer only the segmentation+salience pass.

The salience monotonicity bug. A naive maintenance pass that only decays salience over time will eventually demote every episode to forgettable. The cleaner pattern is the MemoryBank-style read-driven boost: each successful retrieval increments the salience (S += δ), so episodes that keep proving useful keep their weight while unused ones decay. Without the read-driven boost, the salience score is a write-time estimate that ages out of relevance; with it, the score becomes an adaptive measure of actual importance to the read pattern.

The boundary-thrash failure. A noisy semantic-shift threshold can fire on every other turn, producing dozens of tiny single-turn segments. Each tiny segment is below the useful-recall floor (one isolated turn is rarely meaningful out of context); the store fills with noise. Mitigation: enforce a MIN_SEGMENT_TURNS (typically 3-5) so the segmentor can’t close a segment with too few turns — when a signal fires before the minimum, log it as a soft hint on the current segment instead of closing.

The salience cost trap. Running the salience scorer at high quality (large model, temperature=0, long prompt) on every segment costs ~$0.001 per call at small-model rates. For 1000 segments per active user per day that’s $1/day/user in salience-scoring alone. The fix is the same hot-path/deferred split the write policy uses: run a cheap heuristic salience at write time (a regex+length heuristic giving rough 1-10), and run the high-quality LLM salience in a background pass that overrides the heuristic. The deferred path picks up bandwidth from the hot path; the hot path gets a useful-enough score immediately.

Further reading

  • Memory Write Policies: What’s Worth Remembering — the admission-control layer that consumes the segments and salience scores this article produces. Segmentation decides the unit; write policy decides whether the unit enters the store at all.
  • Long-Term Memory: Vector-Backed Episodic Storage — the substrate that holds the segmented episodes and runs the recency × importance × similarity rerank that uses the salience term. The two articles are companions: this one is the upstream operation, that one is the downstream substrate.
  • Memory Retrieval Policies: Recency, Relevance, Importance — the direct read-side consumer of the salience score this article produces. The α·recency + β·importance + γ·similarity rerank uses the anchored 1-10 score as its importance term; without anchored salience, the rerank collapses to recency-plus-similarity.
  • Reflection: From Experiences to Beliefs — the downstream consumer of the salience score this article anchors. Reflection fires on accumulated importance, pulls a window of high-salience episodes, and emits citation-anchored higher-order beliefs that the read path can use in place of re-reasoning from raw turns.