$ cat ai-engineering/long-horizon-reliability.md

Long-Horizon Task Reliability

Drift, checkpointing, and recovery in long-running agents: the distributed-saga parallel, when to abort, and the METR doubling curve.

Jatin Bansal@blog:~/ai-engineering$ open long-horizon-reliability

A code-migration agent has been running for four hours. It has converted 312 of 400 stored procedures from Oracle to Postgres, written 287 passing tests, and committed each as its own PR. Then it picks up the next file, decides the two failures it saw earlier in the run mean the test runner is broken, edits the test harness to “fix” it, breaks every test in the repo, and spends the next 40 minutes trying to repair the damage. Nobody is watching at 2am. The on-call wakes up to 312 good PRs, 88 untouched files, and a corrupted test harness that takes three hours to back out. The model didn’t get worse over those four hours. The loop’s reliability got worse, on a curve that capability benchmarks don’t measure. This is what long-horizon failure looks like in production, and it’s the wall every team that ships agents eventually hits.

Opening bridge

Yesterday’s piece on computer use closed the Agents subtree’s tour of action surfaces. The agent loop and planning-vs-reactive articles framed control flow inside a single run; tool-selection and multi-agent stretched it sideways. Today we stretch it forward in time: what happens when the loop runs for hours, not seconds, and when the budget cap from the agent-loop article is the floor rather than the ceiling. That’s the long-horizon reliability problem, and it sits one rung above the loop in the stack.

Capability vs reliability

The cleanest framing is the reliability-vs-capability split from the late-2026 “Beyond pass@1” framework: capability is whether the model can do a task on its best attempt (pass@1, the standard benchmark axis); reliability is whether it can do the task consistently across attempts of varying duration. A model that scores 90% on a 5-minute task and 44% on a 4-hour version isn’t a worse model — it’s the same model in a different failure regime.

Two headline findings. Frontier models exhibit the highest meltdown rates (DeepSeek V3 at 19%, MiniMax M2.5 at 13% at very-long horizons) precisely because they attempt ambitious multi-step strategies; weaker models hover around 0-4% because they fall back on rote, shallow strategies that don’t have anywhere to drift to. And the failure shape is positive error correlation across steps: once the agent forms a wrong hypothesis it persists rather than recovers, and the reliability curve decays super-linearly relative to an i.i.d. Bernoulli baseline. Capability is the model’s job; reliability is the harness’s job.

The METR doubling curve

The empirical anchor is METR’s time-horizon benchmark, which measures the task duration (calibrated against human completion time) that an agent can do with 50% reliability. As of METR Time Horizon 1.1 in January 2026, Claude Opus 4.5 sits at 320 minutes (5.3 hours), GPT-5 at 214, o3 at 121, Claude Opus 4 at 101. The trend since 2024 is a doubling every 89-131 days, with later updates putting Opus 4.6 at roughly 14.5 hours by February 2026. The “Moore’s law for agents” framing is real and the curve is steep.

Two things to internalize. First, the 50% point is the inflection, not the ceiling. At 80% reliability the same models cap out at a third to a half of the 50% horizon. If “correct most of the time” isn’t enough, you’re targeting an 80% or 95% horizon, and those numbers are much smaller than the headlines. Second, the curve says nothing about failure shape. A model with a 5-hour 50% horizon doesn’t degrade gracefully past 5 hours — it meltdowns. The horizon is a cliff, not a slope, and engineering past it means assuming you’ll fall off and designing for recovery.

The distributed-systems parallel: the saga

The clean analogue is the distributed saga pattern. A saga is a long-running transaction split into local steps, each with a defined compensating action that undoes its effect. If step 7 of 10 fails, the saga executes compensations for steps 1-6 in reverse order rather than holding a global lock for hours.

Each tool call with a side effect is a saga step. The agent writes a PR, sends an email, charges a card — each is a local transaction against an external system; the whole task is the agent’s “transaction.” Two-phase commit doesn’t apply (the agent can’t hold a lock across services for four hours), so the agent looks exactly like a saga: a long chain of local commits, each independently durable, with no global rollback.

Compensations are the agent’s safety net. When step 312 fails and the right answer is “back out the last 5 steps and stop,” the agent needs typed compensations for every state-changing tool — unsend_email, delete_pr, refund_charge. Most production agents don’t have these because tools were designed for single-shot use. Building the compensation surface is half the work of shipping a reliable long-horizon agent. The corollary: idempotency keys are mandatory on every mutating tool, the same way every saga step gets one. A retried charge_card after a checkpoint restart needs to be a no-op, not a second charge.

Drift is the saga’s “lost-update” problem. In a long saga, the orchestrator’s state and the services’ states drift between checkpoints. The same shape shows up in long agent runs: the agent’s internal model of the world (what’s in the conversation history) diverges from the actual state (what’s in the database, the codebase, the ticket). At turn 200 the agent thinks staging is deployed because the tool returned success at turn 47; by turn 200, staging has crashed and someone else has restarted it. The agent’s plan, anchored to stale beliefs, executes against a world that’s moved on.

The saga parallel isn’t a metaphor — it’s the same problem with the same mitigations: typed compensations, idempotency keys, explicit state checkpoints, and an orchestrator that owns recoverability.

The mechanisms of long-horizon drift

Four mechanisms drive the super-linear decay, and they compound. The “Beyond pass@1” framework and the Wire blog’s anatomy of agent drift converge on this list:

Context drift. History grows; older turns become decorations the model attends to less; load-bearing details from turn 10 are diluted by 200 turns of subsequent noise. The lost-in-the-middle failure from the context-engineering article at a longer time scale. Mitigation: aggressive compaction and JIT-only context fetches.
Hallucination cascades. A wrong fact at turn 15 gets cited at turn 30, becomes part of the agent’s “known” state, and by turn 100 the agent is reasoning from a corrupted premise it can’t distinguish from primary observation. Mitigation: explicit provenance tracking — every working-state claim carries a pointer to its source turn, the same shape as the memory-provenance article describes.
Goal drift. “Migrate 400 stored procedures” gets gradually re-interpreted as “improve the test harness” or “refactor the migration script” — adjacent activities that look like progress but aren’t the task. Mitigation: an explicit goal artifact (a typed object, not a paragraph) re-injected at every step.
Meltdown. The pathological end state — the agent transitions from “coherent but wrong” to “incoherent looping, self-contradiction, hallucinated tool outputs.” MOP (meltdown-of-progress) precursors are entropy spikes, repeated tool calls with slightly varied args, contradictions across consecutive turns. Mitigation: detect early, save state and restart with a fresh context window — not just compressing, but saying “this run is poisoned, resume from a checkpoint that wasn’t.”

Context drift makes cascades easier to seed; cascades make goal drift undetectable from inside the loop; goal drift sets up meltdown when the wrong action contradicts the world hard enough.

Checkpointing: between-node vs durable

Once you accept the run will fail somewhere past the 50% horizon, the engineering question is how cheaply you can resume. Production frameworks disagree on what “checkpoint” means.

Step-level (between-node) checkpointing. LangGraph’s persistence layer is the canonical example: a checkpointer saves graph state between nodes. On crash, the next run reads the most recent checkpoint and resumes from the next node. Cheap, human-debuggable, the right primitive for most agents.

The gotcha is what it doesn’t checkpoint. State inside a node is not saved. If one node is doing 200 iterations of work and crashes at iteration 47, the next run restarts that node at iteration 0. Diagrid’s critique of checkpoint-based frameworks is the pointed take. The cost shows up as wasted tokens — 47 API calls re-issued, 47 mutating writes re-attempted (why the idempotency keys above are mandatory) — and as latency on recovery.

Durable-execution checkpointing. Temporal, Restate, and similar frameworks invert the model. The engine records every I/O operation into a durable event log. On crash, the workflow replays from the start, but every recorded I/O returns its previously-cached result instead of re-issuing. The agent never knows it crashed. Trade-off: more infrastructure and a constrained programming model (every side effect wrapped as an “activity”) in exchange for dramatically stronger recovery semantics.

The 2026 production hybrid: LangGraph for control flow with Postgres checkpointing, Temporal (or equivalent) wrapping the outer harness when nodes can take minutes or contain many side effects. The AWS DynamoDB write-up and the LangGraph-vs-Temporal architecture pieces make this case directly. A third primitive worth flagging: event-sourcing the conversation itself — persist every turn with a monotonic sequence number and you can rebuild state at any point by replaying turns up to that number, the continuation-passing-style frame from the agent-loop article made durable.

Code: a checkpointed agent loop in Python with LangGraph

A migration-assistant pattern with explicit drift detection and recovery. The loop checkpoints to SQLite between steps, tracks an explicit goal, and aborts on meltdown precursors. Install: pip install langgraph langgraph-checkpoint-sqlite anthropic. Uses LangGraph and the Anthropic SDK.

python

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
import json, hashlib
from typing import TypedDict, Annotated
from operator import add
from anthropic import Anthropic
from langgraph.graph import StateGraph, START, END
from langgraph.checkpoint.sqlite import SqliteSaver

client = Anthropic()

class AgentState(TypedDict):
    goal: str
    messages: Annotated[list, add]
    step: int
    tool_sigs: Annotated[list, add]
    drift_score: float
    completed_ids: list[str]
    aborted: bool

MELTDOWN_THRESHOLD = 0.7
MAX_STEPS = 500

TOOLS = [
    {"name": "migrate_proc", "description": "Migrate one stored procedure. Idempotent.",
     "input_schema": {"type": "object", "properties": {"path": {"type": "string"}},
                      "required": ["path"]}},
    {"name": "run_tests", "description": "Run the test suite. Read-only.",
     "input_schema": {"type": "object", "properties": {"path": {"type": "string"}},
                      "required": ["path"]}},
    {"name": "submit_final", "description": "Submit final results. Terminates.",
     "input_schema": {"type": "object",
                      "properties": {"summary": {"type": "string"},
                                     "completed": {"type": "integer"}},
                      "required": ["summary", "completed"]}},
]


def call_model(state: AgentState) -> dict:
    # Re-inject the goal at the top of every call. Cheapest mitigation
    # for goal drift — the model sees the original task on every turn,
    # not just at turn 0.
    sys = (
        f"GOAL (immutable): {state['goal']}\n\n"
        f"Step {state['step']} of at most {MAX_STEPS}. Before each tool call, "
        f"confirm it serves the goal. If finished, call submit_final."
    )
    resp = client.messages.create(
        model="claude-opus-4-7", max_tokens=2048,
        system=sys, tools=TOOLS, messages=state["messages"],
    )
    return {"messages": [{"role": "assistant", "content": resp.content}]}


def dispatch_tools(state: AgentState) -> dict:
    last = state["messages"][-1]["content"]
    tool_blocks = [b for b in last if b.type == "tool_use"]
    if not tool_blocks:
        return {"step": state["step"] + 1}

    results, sigs, drift_delta = [], [], 0.0
    new_completed = list(state["completed_ids"])
    for b in tool_blocks:
        sig = hashlib.sha1(
            f"{b.name}:{json.dumps(b.input, sort_keys=True)}".encode()
        ).hexdigest()
        sigs.append(sig)
        # No-progress detection: same call in last 5 → drift signal.
        if sig in state["tool_sigs"][-5:]:
            drift_delta += 0.3

        if b.name == "submit_final":
            return {"messages": [{"role": "user", "content": [
                {"type": "tool_result", "tool_use_id": b.id, "content": "ok"}]}],
                "step": state["step"] + 1}

        # Idempotency receipt for mutating tools — refuse re-execution.
        if b.name == "migrate_proc" and sig in new_completed:
            results.append({"type": "tool_result", "tool_use_id": b.id,
                            "content": "skipped: already completed"})
            continue

        out = execute_tool(b.name, b.input)   # real dispatch here
        results.append({"type": "tool_result", "tool_use_id": b.id,
                        "content": json.dumps(out),
                        "is_error": not out.get("ok", True)})
        if b.name == "migrate_proc" and out.get("ok"):
            new_completed.append(sig)
        if not out.get("ok", True):
            drift_delta += 0.1

    return {"messages": [{"role": "user", "content": results}],
            "tool_sigs": sigs,
            "drift_score": state["drift_score"] + drift_delta,
            "completed_ids": new_completed,
            "step": state["step"] + 1}


def execute_tool(name: str, args: dict) -> dict:
    if name == "migrate_proc": return {"ok": True, "path": args["path"]}
    if name == "run_tests":    return {"ok": True, "passed": True}
    return {"ok": False, "error": f"unknown tool: {name}"}


def should_continue(state: AgentState) -> str:
    if state["step"] >= MAX_STEPS:
        return END
    if state["drift_score"] >= MELTDOWN_THRESHOLD:
        # Meltdown precursor crossed. Bail; caller restarts from the
        # last good checkpoint with a fresh, compacted context.
        return END
    return "call_model"


def build_graph():
    g = StateGraph(AgentState)
    g.add_node("call_model", call_model)
    g.add_node("dispatch", dispatch_tools)
    g.add_edge(START, "call_model")
    g.add_edge("call_model", "dispatch")
    g.add_conditional_edges("dispatch", should_continue,
                            {"call_model": "call_model", END: END})
    return g.compile(checkpointer=SqliteSaver.from_conn_string("agent.sqlite"))


def run(goal: str, *, thread_id: str):
    graph = build_graph()
    initial = {"goal": goal,
               "messages": [{"role": "user", "content": goal}],
               "step": 0, "tool_sigs": [], "drift_score": 0.0,
               "completed_ids": [], "aborted": False}
    # thread_id binds this run to a checkpoint stream. Re-invoking with
    # the same thread_id after a crash resumes from the last checkpoint.
    return graph.invoke(initial, config={"configurable": {"thread_id": thread_id}})

Four things worth flagging. The goal is re-injected at every turn via the system prompt — the cheapest single mitigation for goal drift; without it, by turn 200 the goal is buried under 199 turns of noise. The drift score is a soft halt signal: when it crosses the meltdown threshold the graph routes to END, and the caller restarts from the last checkpoint with a compacted prefix — the MOP-triggered context resetting pattern. Idempotency is enforced at the dispatch layer: a repeated migrate_proc call returns “skipped” without re-executing, so the graph trusts the receipt over the model’s intent. The checkpointer is bound by thread_id: same id across invocations means “resume from the last between-node checkpoint.” Anything happening inside dispatch_tools when a crash hits is lost — for nodes that do many side effects, you want Temporal underneath, not just LangGraph.

Code: a Temporal workflow wrapping the same agent in TypeScript

When the agent’s nodes do real work (many minutes per step, many side effects), wrap the outer loop in Temporal so the whole thing is durable through process crashes, not just between checkpoints. Install: npm install @temporalio/client @temporalio/worker @temporalio/workflow @anthropic-ai/sdk. Uses Temporal TypeScript SDK and the Anthropic SDK.

typescript

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
import {
  proxyActivities, defineSignal, setHandler, workflowInfo,
} from "@temporalio/workflow";
import type * as activities from "./activities";

const { callModel, executeTool, persistCheckpoint } =
  proxyActivities<typeof activities>({
    startToCloseTimeout: "5 minutes",
    retry: { maximumAttempts: 3, initialInterval: "2s" },
  });

export const abortSignal = defineSignal("abort");

const MAX_STEPS = 500;
const MELTDOWN_THRESHOLD = 0.7;

interface RunInput {
  goal: string;
  resumeFrom?: { step: number; messages: any[]; completedIds: string[] };
}

export async function migrationAgent(input: RunInput): Promise<{
  completed: number; aborted: boolean; reason: string;
}> {
  let step = input.resumeFrom?.step ?? 0;
  let messages = input.resumeFrom?.messages ?? [
    { role: "user", content: input.goal },
  ];
  const completedIds = new Set(input.resumeFrom?.completedIds ?? []);
  let driftScore = 0;
  const lastSigs: string[] = [];
  let aborted = false;

  // External signals can abort the workflow — operator dashboard,
  // SLO alert, budget breach.
  setHandler(abortSignal, () => { aborted = true; });

  while (step < MAX_STEPS && !aborted) {
    // Each activity is recorded in Temporal's event history.
    // On worker crash, replay returns the cached results up to the
    // failure point — the workflow never knows it died.
    const assistant = await callModel({ goal: input.goal, messages, step });
    messages.push({ role: "assistant", content: assistant.content });

    const toolUses = assistant.content.filter((b: any) => b.type === "tool_use");
    if (toolUses.length === 0) break;

    const final = toolUses.find((b: any) => b.name === "submit_final");
    if (final) {
      return { completed: final.input.completed, aborted: false, reason: "submit_final" };
    }

    const results: any[] = [];
    for (const b of toolUses) {
      const sig = `${b.name}:${JSON.stringify(b.input)}`;
      if (lastSigs.slice(-5).includes(sig)) driftScore += 0.3;
      lastSigs.push(sig);

      if (b.name === "migrate_proc" && completedIds.has(sig)) {
        results.push({ type: "tool_result", tool_use_id: b.id,
                       content: "skipped: already completed" });
        continue;
      }

      const out = await executeTool({ name: b.name, args: b.input });
      results.push({ type: "tool_result", tool_use_id: b.id,
                     content: JSON.stringify(out), is_error: !out.ok });
      if (b.name === "migrate_proc" && out.ok) completedIds.add(sig);
      if (!out.ok) driftScore += 0.1;
    }
    messages.push({ role: "user", content: results });
    step += 1;

    // Snapshot every 25 steps. Temporal's event log is the primary
    // durability layer; this is the application-level checkpoint used
    // to start a *fresh* workflow with a compacted prefix on meltdown.
    if (step % 25 === 0) {
      await persistCheckpoint({ workflowId: workflowInfo().workflowId,
                                step, messages, completedIds: [...completedIds] });
    }

    if (driftScore >= MELTDOWN_THRESHOLD) {
      await persistCheckpoint({ workflowId: workflowInfo().workflowId,
                                step, messages, completedIds: [...completedIds] });
      return { completed: completedIds.size, aborted: true, reason: "meltdown" };
    }
  }

  return { completed: completedIds.size, aborted,
           reason: aborted ? "external_signal" : "step_cap" };
}

The Temporal workflow gives you three properties LangGraph alone doesn’t. Durable I/O semantics: every activity is recorded in the event history; a worker crash mid-step doesn’t re-issue completed activities on restart. External signals: signalWithStart(workflowId, abortSignal) from a dashboard sets aborted = true on the next loop iteration. Resumability with a different initial state: when drift crosses the threshold, the orchestrator inspects the checkpoint, compacts the messages, and spawns a fresh workflow with the compacted prefix — the MOP-restart pattern with the saga’s compensation-and-resume shape underneath. Note: the executeTool activity itself must be idempotent — Temporal retries activities transparently on worker crashes, and the tool needs to absorb that.

When to abort, not retry

The wrong instinct on a long-horizon failure is “retry from the failure point.” The failure is rarely the failure — it’s the symptom of accumulated drift, and retrying from the same drift-poisoned context produces the same shape of failure with new noise. The right instinct is more often abort, salvage, restart.

A checklist for the abort decision:

Drift score over threshold. Abort, save state, restart with a compacted context.
Goal drift detected. A second model (or a deterministic check) audits whether recent actions still serve the goal. If the audit fires, abort and re-plan from the original goal.
Compensable side effects accumulating. If you’ve shipped 312 PRs and the next 88 start failing in correlated ways, stop. Salvage the 312.
Non-compensable side effects imminent. Email send, payment capture, irrevocable account changes. If drift is climbing and the next predicted action is non-compensable, abort before the action. Better to lose 5 minutes of work than send the wrong email to 200 customers.
Wall-clock or budget breach. The agent-loop budgets still apply — they just fire later.

The aborted-with-partial-results outcome is a first-class success state, not a failure. A migration agent that finished 312 of 400 procedures and stopped cleanly is a win; one that finished 400 of 400 with a broken test harness is a loss. The harness must expose partial state cleanly — completion list, abort reason, recovery hints — so the next run or the human operator can pick up without re-doing work.

Trade-offs, failure modes, gotchas

Drift detection is noisy. Every drift signal has a false-positive regime — flaky API calls, legitimate re-exploration of the same directory during a refactor. Tuning the threshold is task-specific. Log drift score over time, find the natural noise floor, and set the threshold well above noise and well below the meltdown line.

Checkpoints lie about what was durable. A checkpoint records that the harness thinks a step happened; it doesn’t record whether the external system actually committed. The standard saga fix applies: every mutating tool’s success criterion must be read-after-write confirmation, not request-acknowledgement.

Compacted-context restart isn’t free. The MOP-restart pattern works because the new context is clean — but the agent has to re-learn what it knew. The summary carried across restart must encode every load-bearing fact (completed IDs, environment state, decisions made). Drop the wrong one and you start a fresh hallucination cascade with the same shape. Context compression discipline applies directly.

Long-horizon evals don’t transfer from short-horizon evals. A model that scores 95% on a 30-minute task suite may score 40% on a 4-hour suite. Build a long-horizon eval — even 10-20 tasks at the target duration — and run it on every harness change.

Memory scaffolds can hurt long-horizon reliability. The “Beyond pass@1” framework reports that memory-augmented scaffolds never improved long-horizon reliability on the tasks they tested, and hurt 6 of 10 models — likely the same hallucination-cascade mechanism with a longer corruption window. Apply the memory write policy and reflection discipline: audit writes carefully.

Procedural memory is the highest-leverage long-horizon win. If the agent does similar long-horizon tasks repeatedly, the procedural-memory pattern is a 2-3× reliability improvement on top of everything in this article. Cache plan templates from successful runs, key by task shape, retrieve at the start of the next. The harness uses the template as prior, not the model’s whole-task re-derivation.

Multi-agent isn’t a solution to long-horizon reliability. The multi-agent piece trades tokens for parallelism — not for reliability. If each subagent’s task is long-horizon, you’ve moved the problem. Multi-agent buys parallelism on naturally-parallel work; long-horizon reliability buys sequential work that doesn’t fit in one context. Both axes need to be engineered explicitly.

Observability is the difference between recoverable and unrecoverable. A 4-hour agent run with no traces is a 4-hour run you can’t debug. Log every tool call, drift signal, checkpoint, and usage row to a durable trace store (Langfuse, Arize Phoenix, LangSmith, or an OTel pipeline) — the production tracing piece is the dedicated walk-through of span shape and the platform decision. Without it, you can’t tell whether the failure was at step 47, 312, or distributed across 200 small mistakes — and the right intervention is different for each.

What to read next

Anatomy of an Agent Harness — the runtime layer that owns recoverability. Saga compensations, idempotency keys, durable-execution semantics, the abort-vs-retry decision — every primitive in today’s piece lives inside the harness, not the model.
Conversation Compaction: Keeping Long Sessions Alive — the orchestration of the in-place compaction primitive that makes MOP-restart possible. Reactive vs preemptive triggers, cache-aware deletion, and the snapshot-and-rollback discipline ported from today’s checkpointing primitives to a much finer grain.
The Agent Loop: ReAct and Its Descendants — the loop body today’s piece sits on top of. The budgets, no-progress detection, and idempotency primitives are the same primitives that fire later (and harder) in long-horizon work.
Agent Budgets and Runaway Prevention — the enforcement primitives that fire before the next step in a long-horizon run. The seven gates (step, deadline, tokens, dollars, per-tool quota, no-progress, abort) plus the discipline of persisting partial state on breach, which is what makes the saga-compensation surface in today’s piece actually recoverable.