$ cat ai-engineering/agent-harness-anatomy.md

Anatomy of an Agent Harness

Inside the agent harness: context assembly, tool dispatch, streaming, cache management, error recovery, cost accounting, telemetry — and build-vs-buy.

Jatin Bansal@blog:~/ai-engineering$ open agent-harness-anatomy

A team ships a customer-support agent. The proof-of-concept was 80 lines of Python around the Anthropic SDK — a while loop, a messages.create, a dispatch on tool_use blocks. Six weeks in, the same code base is 4,000 lines: a tool-retry policy, a token counter that diverges from the dashboard by 8%, a half-finished compactor, a cancel button that doesn’t propagate upstream, a Postgres conversation log, a tracer that drops the parent span on errors, and a system prompt with three timestamps killing the cache hit rate. None of those pieces is hard. Owning them as a coherent runtime is hard, and that runtime — the thing wrapped around the model — is the harness. This article is its anatomy.

Opening bridge

The agent-loop article split the agent into model decisions and harness duties and promised a dedicated piece on the second. Subsequent articles leaned on the same vocabulary: multi-agent orchestration said “the protocol the harness defines,” computer use said “the harness translates element IDs to locators,” long-horizon reliability said “the harness owns recoverability.” That word has been a placeholder across six articles. Today we make it concrete: the seven duties of a production harness, what couples them into one runtime, and when off-the-shelf fits. This closes the Agents subtree before we return to memory.

Definition

A harness is the runtime layer that turns a stateless LLM into a stateful, observable, recoverable, budgeted agent. It owns seven duties: context assembly (building the next messages payload from system prompt, tool catalog, persisted history, retrieved memory, current user input); tool dispatch (routing tool_use blocks to executors with validation and timeouts); streaming (passing the SSE token stream through while accumulating internal state); prompt-cache management (breakpoints, churn suppression, hit-rate monitoring); error recovery (typed tool_result errors, backoff, circuit breakers, fatal-error escalation); cost accounting (input/output/cached tokens to dollars, with per-step and per-session caps); telemetry (span-level traces for every model and tool call, exported to a durable store). A harness that does six of these well and the seventh badly isn’t “mostly done” — it’s broken at the weakest seam, and the weakest duty determines how the whole runtime fails.

The kernel/userspace parallel

The harness is the kernel; the model is userspace. Phil Schmid’s 2026 piece draws the same shape with model-as-CPU, context-as-RAM, harness-as-OS, agent-as-application — load-bearing claim identical: the harness owns every privileged operation, the model sees only an abstraction surface. The mappings are precise. System calls are tool calls — both layers exist to centralize the trust boundary. Process scheduling is step budgeting — the kernel decides when the OOM killer fires; the harness decides when no-progress detection cuts the run. Virtual memory is context assembly — the kernel lied to userspace about contiguous memory; the harness lies to the model about which facts have always been there. The trap table is the error translator — hardware faults become signals; tool exceptions become tool_result blocks with is_error: true. Auditing is telemetry.

This isn’t decorative. Don’t ask the model to enforce a token budget any more than you’d ask a process to enforce its own scheduling quantum. The model is the policy; the harness is the kernel that turns policy into a well-formed system call sequence. Martin Fowler’s harness-engineering piece makes the same point for coding agents: the harness is “a specific form of context engineering” whose feedforward and feedback controls are the kernel’s guides and sensors.

Duty 1: context assembly

Every turn, the harness builds the next API payload from disparate sources in a specific order: tool catalog (cached) → system prompt (cached) → persisted memory (per-session, cached) → retrieved JIT chunks (RAG output, below the breakpoint, not cached) → conversation history (cached up to the last completed turn) → current user message. The ordering follows prompt-cache mechanics: stable content first, churning content last, so the cacheable prefix grows monotonically. Inverting the persistent-memory and JIT slots silently kills the hit rate — the model sees the same prompt, the provider sees a different prefix. This is the single most expensive assembly bug, invisible without telemetry that reports cache_creation_input_tokens separately from cache_read_input_tokens.

Two sub-disciplines: stitching from persistence replays prior tool_use/tool_result pairs verbatim (the API expects byte-identical blocks; assistant turns are immutable records), and budgeting the window estimates token count before the call and triggers compaction when needed — compaction must be cache-aware, because a summarizer that rewrites the prefix invalidates every downstream cache.

Duty 2: tool dispatch

For every tool_use block, the harness must validate input against the schema (redundant under strict-mode constraints), authorize against per-tool ACLs and sandbox boundaries (the OpenAI Agents SDK’s April 2026 manifest abstraction and Claude Code’s permission gate are this layer; the guardrails article covers how the classifier stack complements this harness-side authorization), execute under a wall-clock deadline, idempotency-key mutating tools (the model is an at-least-once caller, per the tool-use article), serialize results and errors as tool_result blocks (exceptions never leak past dispatch), and parallelize independent calls — both providers default to concurrent execution, which is the wrong default for state-touching tools.

Dispatch gets subtle across a large tool catalog. Past ~30 tools the harness injects only a retrieved subset per turn — embedding-based selection, namespacing, lazy schema loading — which happens inside context assembly but shares a consistency budget with dispatch: too narrow and the model can’t reach the tool it needs; too wide and cache discipline collapses under tool-list churn.

Duty 3: streaming

The harness sits between two streams: the inbound SSE event stream from the provider (content_block_delta, input_json_delta, etc.) and the outbound chunks the caller sees. The non-trivial part is that the harness also consumes the inbound stream for its own bookkeeping — as input_json_delta events accumulate for a tool_use block, it incrementally parses the partial JSON and prepares dispatch (lookup, credentials, rate-limit reservation) so the executor fires the moment content_block_stop arrives. A naive harness waits for the full assistant turn and adds 100–500ms of serial latency per tool call.

Cancellation is the other hot edge. “Stop” must (a) close the upstream connection so the provider stops decoding (and billing), (b) abort in-flight tool executions whose results will never be used, (c) flush partial state to the conversation log. Forgetting any of the three is how cancel buttons end up purely cosmetic.

Duty 4: prompt-cache management

The harness owns the cache hit rate. Breakpoint placement: on Anthropic, mark up to 4 cache_control breakpoints (canonical: last tool, system prompt, lookback window, current message); on OpenAI, set prompt_cache_key to a stable shard hint (per-tenant or per-application-version). Churn suppression: no timestamps in the prefix, no per-request IDs, no tool-description edits between deploys — a linter that fails CI on datetime.now() in the system prompt is a legitimate harness component. Monitoring: aggregate cache_creation_input_tokens and cache_read_input_tokens into per-session and per-tenant hit-rate metrics, alert on degradation, tag traces with cache state. A harness without cache telemetry runs at 5-10× the cost it should — invisible until the invoice arrives.

Duty 5: error recovery

Three flavors, three paths. Transient errors (network blip, 503, rate-limit) are retried in dispatch with capped exponential backoff — the model never sees them, like a kernel retrying a transient disk read before surfacing EIO. Model-recoverable failures (invalid argument, missing resource, business-rule violation) become tool_result blocks with is_error: true and a precise message — the model is very good at recovering from “tool X failed: ‘price_id’ not found, did you mean ‘product_id’?” and terrible at recovering from raised exceptions. Fatal errors (provider 500s, exhausted budget, meltdown precursors) terminate cleanly, persist partial state, return structured failure with the saga compensation surface from the long-horizon reliability article. Every error has exactly one handling path, decided by the harness.

Duty 6: cost accounting

Two counters per session — tokens (input + output + cached-read + cached-write) and dollars (priced from the rate card with separate cache tiers) — broken down by call, tool, and session. The harness aggregates online and enforces the budget before the call, per the agent-loop article. The common trap: counting tokens off usage without separating cache hits from misses. The dashboard reads “1.2M tokens” but the bill is $30, not $300, because most were cache reads at 10% of base price; or the inverse, where a deploy introduces a cache-busting timestamp and the bill is silently 10×. Cost accounting that doesn’t break down by cache state is useless for diagnosis.

Duty 7: telemetry

Per turn: a parent span for the full turn (tokens, latency, cache state, cost); child spans for each model call, tool call, retry, and compaction trigger; attributes for model name, tool names, signal flags (error, cache_miss, budget_breach), session/tenant IDs; payloads behind a sampling flag with PII scrubbed. Langfuse, Arize Phoenix, LangSmith, and OpenTelemetry collectors with the GenAI semantic conventions all accept this shape — the production-tracing piece is the dedicated walk-through of span shape, OTel vs OpenInference, sampling, and the build-vs-buy decision across the platforms. Telemetry isn’t bolted on at the end — it decides whether the previous six duties are debuggable. A harness with bad traces is untestable in production; a harness with good traces lets one engineer reconstruct a five-step failure path in two minutes.

Code: a minimum-viable harness in Python

The whole anatomy fits in one ~140-line example — pedagogical, not production-ready, but every duty appears somewhere. Install: pip install anthropic. Uses the Anthropic SDK.

python

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
import json, time, uuid, hashlib, logging
from dataclasses import dataclass, field
from typing import Callable
from anthropic import Anthropic

log = logging.getLogger("harness")
client = Anthropic()

@dataclass
class ToolSpec:
    name: str
    schema: dict
    fn: Callable[[dict], dict]
    mutating: bool = False
    timeout_s: float = 10.0

@dataclass
class Budget:
    max_steps: int = 20
    max_seconds: float = 60.0
    max_dollars: float = 1.00

@dataclass
class Usage:
    input_tokens: int = 0
    output_tokens: int = 0
    cache_read: int = 0
    cache_write: int = 0
    dollars: float = 0.0

# Price table per million tokens (Claude Opus 4.7 as of 2026-05).
PRICES = {"input": 15.0, "output": 75.0, "cache_read": 1.5, "cache_write": 18.75}

class Harness:
    def __init__(self, model: str, tools: list[ToolSpec], system: str, budget: Budget):
        self.model = model
        self.tools = {t.name: t for t in tools}
        # Duty 1 (assembly): cache breakpoint on the system block, stable prefix.
        self.system = [{"type": "text", "text": system,
                        "cache_control": {"type": "ephemeral"}}]
        self.tool_schemas = [{"name": t.name, "input_schema": t.schema,
                              "description": t.schema.get("description", "")}
                             for t in tools.values() if (t := self.tools[t.name])]
        self.budget = budget
        self.usage = Usage()
        # Duty 5 (recovery): per-(tool, args) idempotency receipts.
        self.idem: dict[str, dict] = {}

    def _account(self, u) -> None:
        # Duty 6 (cost): full breakdown by cache state.
        self.usage.input_tokens += u.input_tokens
        self.usage.output_tokens += u.output_tokens
        self.usage.cache_read += getattr(u, "cache_read_input_tokens", 0) or 0
        self.usage.cache_write += getattr(u, "cache_creation_input_tokens", 0) or 0
        self.usage.dollars += (
            u.input_tokens * PRICES["input"] / 1e6
            + u.output_tokens * PRICES["output"] / 1e6
            + (getattr(u, "cache_read_input_tokens", 0) or 0) * PRICES["cache_read"] / 1e6
            + (getattr(u, "cache_creation_input_tokens", 0) or 0) * PRICES["cache_write"] / 1e6
        )

    def _trace(self, span: str, **attrs) -> None:
        # Duty 7 (telemetry): structured spans. Real impl ships to OTel/Langfuse.
        log.info(json.dumps({"span": span, "ts": time.time(), **attrs}))

    def _dispatch(self, block) -> dict:
        # Duty 2 (dispatch): authorize, idempotency-key, execute with timeout,
        # serialize errors. Mutating tools get a cache that absorbs retries.
        tool = self.tools.get(block.name)
        if tool is None:
            return {"type": "tool_result", "tool_use_id": block.id, "is_error": True,
                    "content": f"unknown tool: {block.name}"}
        if tool.mutating:
            key = hashlib.sha1(f"{block.name}:{json.dumps(block.input, sort_keys=True)}"
                               .encode()).hexdigest()
            if key in self.idem:
                self._trace("dispatch.idem_hit", tool=block.name, key=key)
                return {"type": "tool_result", "tool_use_id": block.id,
                        "content": json.dumps(self.idem[key])}
        try:
            t0 = time.monotonic()
            result = tool.fn(block.input)   # real impl wraps with a deadline
            dt = time.monotonic() - t0
            self._trace("dispatch.ok", tool=block.name, ms=int(dt * 1000))
            if tool.mutating:
                self.idem[key] = result
            return {"type": "tool_result", "tool_use_id": block.id,
                    "content": json.dumps(result)}
        except Exception as e:
            self._trace("dispatch.error", tool=block.name, error=str(e))
            return {"type": "tool_result", "tool_use_id": block.id, "is_error": True,
                    "content": f"{type(e).__name__}: {e}"}

    def run(self, user_msg: str) -> str:
        run_id = str(uuid.uuid4())
        messages = [{"role": "user", "content": user_msg}]
        started = time.monotonic()
        self._trace("run.start", run_id=run_id)

        for step in range(self.budget.max_steps):
            # Budget gate BEFORE the call — duty 6 + duty 5.
            if time.monotonic() - started > self.budget.max_seconds:
                self._trace("run.abort", reason="deadline", step=step)
                return "[aborted: deadline]"
            if self.usage.dollars > self.budget.max_dollars:
                self._trace("run.abort", reason="budget", step=step,
                            dollars=self.usage.dollars)
                return "[aborted: budget]"

            # Duty 3 (streaming) would wrap the call here. For brevity, non-streamed.
            resp = client.messages.create(
                model=self.model, max_tokens=2048,
                system=self.system, tools=self.tool_schemas, messages=messages,
            )
            self._account(resp.usage)
            self._trace("model.call", step=step,
                        in_tokens=resp.usage.input_tokens,
                        out_tokens=resp.usage.output_tokens,
                        cache_read=getattr(resp.usage, "cache_read_input_tokens", 0),
                        cache_write=getattr(resp.usage, "cache_creation_input_tokens", 0),
                        stop_reason=resp.stop_reason)
            messages.append({"role": "assistant", "content": resp.content})

            if resp.stop_reason != "tool_use":
                self._trace("run.complete", step=step, dollars=self.usage.dollars)
                return "".join(b.text for b in resp.content if b.type == "text")

            results = [self._dispatch(b) for b in resp.content if b.type == "tool_use"]
            messages.append({"role": "user", "content": results})

        self._trace("run.abort", reason="step_cap", steps=self.budget.max_steps)
        return "[aborted: step cap]"

The interesting thing isn’t any individual line — it’s the coupling. The accounting in _account writes into the same Usage object the budget gate reads from. The trace in _dispatch feeds the same pipeline as the trace in run. The idempotency cache makes model.call safely retryable from a higher layer. None of the duties is independent; they share state in ways the model can’t see. That coupling is the harness. Splitting the duties across libraries that don’t share state is how home-grown harnesses end up with the symptoms in the opening paragraph.

Code: the same shape via the OpenAI Agents SDK in TypeScript

For contrast, the framework version. The OpenAI Agents SDK shipped a model-native harness with sandbox execution in April 2026; the TypeScript SDK is the production-friendly path for Node/Edge surfaces. Install: npm install @openai/agents zod.

typescript

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import { Agent, Runner, tool } from "@openai/agents";
import { z } from "zod";

const refundOrder = tool({
  name: "refund_order",
  description: "Refund an order. MUTATING. Idempotent on order_id.",
  parameters: z.object({ order_id: z.string(), reason: z.string() }),
  execute: async ({ order_id, reason }) => {
    // Idempotency lives in the refund service, not the SDK.
    return { refunded: true, order_id, dollars: 19.99 };
  },
});

const agent = new Agent({
  name: "support-agent",
  instructions: "Refund only with explicit user consent.",
  model: "gpt-5.5",
  tools: [refundOrder],
});

const runner = new Runner({ maxTurns: 20 });   // step cap; SDK emits OTel traces.

export async function handleTurn(userMessage: string) {
  const result = await runner.run(agent, userMessage);
  return { text: result.finalOutput, usage: result.usage };
}

The SDK collapses the loop, the tool_result plumbing, parallel dispatch, and trace emission. What you still own: the calibration of maxTurns, the wall-clock deadline (wrap runner.run in Promise.race with setTimeout), the dollar ceiling (SDK reports usage, doesn’t enforce a cap), idempotency on mutating tools. The framework moves the seam; it doesn’t remove it.

The 2026 framework landscape

Four buckets, each with a different theory of where the harness boundary sits. Model-vendor harnesses — Claude Agent SDK and the OpenAI Agents SDK — power the vendors’ own products (Claude Code, Codex); opinionated, fast-moving, tightly coupled to one API surface; strict mode, prompt caching, parallel tool calls “just work” because SDK and API ship together. Portability is the trade. Graph harnesses — LangGraph (1.0 GA in 2025, 1.2 shipped May 11, 2026) and Google’s ADK — model the agent as a state graph with checkpointed transitions; strong on durable execution, weaker on cache hygiene because the graph engine is one abstraction away from cache_control markers. TypeScript-native — Mastra (@mastra/[email protected] on April 30, 2026), batteries-included with memory, RAG, tools, evals, and observability in one package for Node/Edge runtimes. Durable-execution — Temporal, Restate, Inngest — not agent frameworks but workflow engines you build the loop on top of, as the long-horizon reliability article detailed. Common pairings: LangGraph + Claude Agent SDK, Mastra + Inngest, LangGraph wrapping Temporal.

Build vs buy

The reflex is to pick a framework; the right decision is calibration-driven. Off-the-shelf harnesses come pre-calibrated for a particular workload shape — chat with light tool use, coding agents, research workflows. When your workload’s calibration assumptions match, you ship months faster. When they diverge, the framework’s defaults become a tax you pay on every turn, and its abstractions hide exactly the seams you need to optimize. Concrete assumptions to check: step cap defaults (Vercel AI SDK’s stepCountIs(20), OpenAI Agents SDK’s maxTurns, LangGraph’s recursion limit) against your actual task distribution; default tool-error handling (silent retry vs surface); cache breakpoint control (cache_control exposed or managed opaquely); state checkpointing semantics (between-node vs durable); telemetry shape (does the schema match your stack); provider coupling (one API or many, at what feature-parity cost).

If three or more of those assumptions are wrong for your workload, the framework will fight you on every iteration. If none are wrong, buy without guilt — Anthropic’s own building effective agents recommends starting with direct API calls precisely because most teams don’t need the framework’s abstractions yet. The build case is not “frameworks are bad” — it’s “the framework’s default calibration must approximately match your workload’s, or you’re paying for assumptions you don’t get to benefit from.” A reasonable progression: start with a 200-line custom harness on the vendor SDK; when the runtime hits a complexity wall — usually durable execution, multi-tenant cache discipline, or multi-agent coordination — adopt the framework whose calibration matches yours most cleanly. The pre-framework prototype gives you the vocabulary to read the docs and the empirical numbers to evaluate the defaults.

Trade-offs, failure modes, gotchas

The model thinks it’s running the show; the harness is. The most common architecture mistake is asking the model to enforce something the harness should enforce — “respect the budget”, “don’t loop”, “be careful with the database.” The model has no mechanism to enforce any of those across calls; it sees one turn at a time. The prompt should not contain the word “budget”.

Hidden state coupling between duties. The seven duties share state through the same in-memory objects. A “clean architecture” that splits accounting into one module, telemetry into another, dispatch into a third, with no shared state, will silently emit traces that don’t match the budget that doesn’t match the invoice. Either share state explicitly or stop claiming to have a coherent runtime.

Cache invalidation hidden in a refactor. Renaming a tool, reordering tools in the array, adding a comma to the system prompt — every textual change to the prefix invalidates the cache. The harness should assert the tokenized prefix hash in CI, with a deliberate update step when intentional.

Provider differences that leak past the harness. Anthropic’s tool_choice is {type: "auto" | "any" | "tool" | "none"}; OpenAI’s is "auto" | "required" | "none". Anthropic’s stop reason is "tool_use"; OpenAI’s is "tool_calls". A harness that abstracts over both ends up lossy (smallest-common surface) or leaky (the abstraction breaks on advanced features). Pick one provider as canonical and translate at the boundary.

Multi-tenancy collapses caches. A harness serving 10,000 users with per-user prompts is writing 10,000 distinct caches and reading from none of them. Use prompt_cache_key (OpenAI) or a stable prefix structure (Anthropic) to share the cacheable prefix across tenants; put tenant-specific content after the breakpoint. Forgetting this is how a “we turned on caching” report and a flat cost graph coexist.

Sandbox boundaries belong to the harness, not the tool. The OpenAI Agents SDK’s April 2026 sandbox and Claude Code’s permission gate live in the harness because the trust boundary is the harness’s job. A tool that sandboxes itself ships its own kernel — duplication and security risk in one move.

Eval the harness, not just the model. A model upgrade is a small change relative to a harness refactor; the second one is what breaks production. Run an eval suite end-to-end against frozen tasks and gate harness changes behind it the same way you gate prompt changes.

What to read next

Production Tracing and Observability for LLM Systems — duty 7 deep dive. Span shape for an LLM trace, the OpenTelemetry GenAI semantic conventions vs OpenInference, sampling and PII policies, and the build-vs-buy decision across LangSmith, Langfuse, Phoenix, Datadog, and Honeycomb. The telemetry surface that makes the other six duties debuggable.
Conversation Compaction: Keeping Long Sessions Alive — the deep dive on duty 1’s hardest sub-problem and duty 5’s worst failure mode. Reactive vs preemptive triggers, cache-aware surgical deletion, circuit breakers, snapshot-and-rollback, and append-only memory journals as the architectural alternative.
Prompt Caching: Reusing the KV Cache Across Calls — duty 4 deep dive. The cache discipline the harness owns and the lever with the biggest economic effect on a production agent.
Agent Budgets and Runaway Prevention — duty 6 deep dive. The cost-accounting and enforcement story this article sketched: the seven primitives (step cap, deadline, token ceiling, dollar cap, per-tool quota, no-progress, external abort), the OS and distributed-systems heritage, and the alerts-are-not-enforcement framing that turns observability into the audit log for the budget gate, not a substitute for it.