Anatomy of an Agent Harness
Inside the agent harness: context assembly, tool dispatch, streaming, cache management, error recovery, cost accounting, telemetry — and build-vs-buy.
A team ships a customer-support agent. The proof-of-concept was 80 lines of Python around the Anthropic SDK — a while loop, a messages.create, a dispatch on tool_use blocks. Six weeks in, the same code base is 4,000 lines: a tool-retry policy, a token counter that diverges from the dashboard by 8%, a half-finished compactor, a cancel button that doesn’t propagate upstream, a Postgres conversation log, a tracer that drops the parent span on errors, and a system prompt with three timestamps killing the cache hit rate. None of those pieces is hard. Owning them as a coherent runtime is hard, and that runtime — the thing wrapped around the model — is the harness. This article is its anatomy.
Opening bridge
The agent-loop article split the agent into model decisions and harness duties and promised a dedicated piece on the second. Subsequent articles leaned on the same vocabulary: multi-agent orchestration said “the protocol the harness defines,” computer use said “the harness translates element IDs to locators,” long-horizon reliability said “the harness owns recoverability.” That word has been a placeholder across six articles. Today we make it concrete: the seven duties of a production harness, what couples them into one runtime, and when off-the-shelf fits. This closes the Agents subtree before we return to memory.
Definition
A harness is the runtime layer that turns a stateless LLM into a stateful, observable, recoverable, budgeted agent. It owns seven duties: context assembly (building the next messages payload from system prompt, tool catalog, persisted history, retrieved memory, current user input); tool dispatch (routing tool_use blocks to executors with validation and timeouts); streaming (passing the SSE token stream through while accumulating internal state); prompt-cache management (breakpoints, churn suppression, hit-rate monitoring); error recovery (typed tool_result errors, backoff, circuit breakers, fatal-error escalation); cost accounting (input/output/cached tokens to dollars, with per-step and per-session caps); telemetry (span-level traces for every model and tool call, exported to a durable store). A harness that does six of these well and the seventh badly isn’t “mostly done” — it’s broken at the weakest seam, and the weakest duty determines how the whole runtime fails.
The kernel/userspace parallel
The harness is the kernel; the model is userspace. Phil Schmid’s 2026 piece draws the same shape with model-as-CPU, context-as-RAM, harness-as-OS, agent-as-application — load-bearing claim identical: the harness owns every privileged operation, the model sees only an abstraction surface. The mappings are precise. System calls are tool calls — both layers exist to centralize the trust boundary. Process scheduling is step budgeting — the kernel decides when the OOM killer fires; the harness decides when no-progress detection cuts the run. Virtual memory is context assembly — the kernel lied to userspace about contiguous memory; the harness lies to the model about which facts have always been there. The trap table is the error translator — hardware faults become signals; tool exceptions become tool_result blocks with is_error: true. Auditing is telemetry.
This isn’t decorative. Don’t ask the model to enforce a token budget any more than you’d ask a process to enforce its own scheduling quantum. The model is the policy; the harness is the kernel that turns policy into a well-formed system call sequence. Martin Fowler’s harness-engineering piece makes the same point for coding agents: the harness is “a specific form of context engineering” whose feedforward and feedback controls are the kernel’s guides and sensors.
Duty 1: context assembly
Every turn, the harness builds the next API payload from disparate sources in a specific order: tool catalog (cached) → system prompt (cached) → persisted memory (per-session, cached) → retrieved JIT chunks (RAG output, below the breakpoint, not cached) → conversation history (cached up to the last completed turn) → current user message. The ordering follows prompt-cache mechanics: stable content first, churning content last, so the cacheable prefix grows monotonically. Inverting the persistent-memory and JIT slots silently kills the hit rate — the model sees the same prompt, the provider sees a different prefix. This is the single most expensive assembly bug, invisible without telemetry that reports cache_creation_input_tokens separately from cache_read_input_tokens.
Two sub-disciplines: stitching from persistence replays prior tool_use/tool_result pairs verbatim (the API expects byte-identical blocks; assistant turns are immutable records), and budgeting the window estimates token count before the call and triggers compaction when needed — compaction must be cache-aware, because a summarizer that rewrites the prefix invalidates every downstream cache.
Duty 2: tool dispatch
For every tool_use block, the harness must validate input against the schema (redundant under strict-mode constraints), authorize against per-tool ACLs and sandbox boundaries (the OpenAI Agents SDK’s April 2026 manifest abstraction and Claude Code’s permission gate are this layer), execute under a wall-clock deadline, idempotency-key mutating tools (the model is an at-least-once caller, per the tool-use article), serialize results and errors as tool_result blocks (exceptions never leak past dispatch), and parallelize independent calls — both providers default to concurrent execution, which is the wrong default for state-touching tools.
Dispatch gets subtle across a large tool catalog. Past ~30 tools the harness injects only a retrieved subset per turn — embedding-based selection, namespacing, lazy schema loading — which happens inside context assembly but shares a consistency budget with dispatch: too narrow and the model can’t reach the tool it needs; too wide and cache discipline collapses under tool-list churn.
Duty 3: streaming
The harness sits between two streams: the inbound SSE event stream from the provider (content_block_delta, input_json_delta, etc.) and the outbound chunks the caller sees. The non-trivial part is that the harness also consumes the inbound stream for its own bookkeeping — as input_json_delta events accumulate for a tool_use block, it incrementally parses the partial JSON and prepares dispatch (lookup, credentials, rate-limit reservation) so the executor fires the moment content_block_stop arrives. A naive harness waits for the full assistant turn and adds 100–500ms of serial latency per tool call.
Cancellation is the other hot edge. “Stop” must (a) close the upstream connection so the provider stops decoding (and billing), (b) abort in-flight tool executions whose results will never be used, (c) flush partial state to the conversation log. Forgetting any of the three is how cancel buttons end up purely cosmetic.
Duty 4: prompt-cache management
The harness owns the cache hit rate. Breakpoint placement: on Anthropic, mark up to 4 cache_control breakpoints (canonical: last tool, system prompt, lookback window, current message); on OpenAI, set prompt_cache_key to a stable shard hint (per-tenant or per-application-version). Churn suppression: no timestamps in the prefix, no per-request IDs, no tool-description edits between deploys — a linter that fails CI on datetime.now() in the system prompt is a legitimate harness component. Monitoring: aggregate cache_creation_input_tokens and cache_read_input_tokens into per-session and per-tenant hit-rate metrics, alert on degradation, tag traces with cache state. A harness without cache telemetry runs at 5-10× the cost it should — invisible until the invoice arrives.
Duty 5: error recovery
Three flavors, three paths. Transient errors (network blip, 503, rate-limit) are retried in dispatch with capped exponential backoff — the model never sees them, like a kernel retrying a transient disk read before surfacing EIO. Model-recoverable failures (invalid argument, missing resource, business-rule violation) become tool_result blocks with is_error: true and a precise message — the model is very good at recovering from “tool X failed: ‘price_id’ not found, did you mean ‘product_id’?” and terrible at recovering from raised exceptions. Fatal errors (provider 500s, exhausted budget, meltdown precursors) terminate cleanly, persist partial state, return structured failure with the saga compensation surface from the long-horizon reliability article. Every error has exactly one handling path, decided by the harness.
Duty 6: cost accounting
Two counters per session — tokens (input + output + cached-read + cached-write) and dollars (priced from the rate card with separate cache tiers) — broken down by call, tool, and session. The harness aggregates online and enforces the budget before the call, per the agent-loop article. The common trap: counting tokens off usage without separating cache hits from misses. The dashboard reads “1.2M tokens” but the bill is $30, not $300, because most were cache reads at 10% of base price; or the inverse, where a deploy introduces a cache-busting timestamp and the bill is silently 10×. Cost accounting that doesn’t break down by cache state is useless for diagnosis.
Duty 7: telemetry
Per turn: a parent span for the full turn (tokens, latency, cache state, cost); child spans for each model call, tool call, retry, and compaction trigger; attributes for model name, tool names, signal flags (error, cache_miss, budget_breach), session/tenant IDs; payloads behind a sampling flag with PII scrubbed. Langfuse, Arize Phoenix, LangSmith, and OpenTelemetry collectors with the GenAI semantic conventions all accept this shape. Telemetry isn’t bolted on at the end — it decides whether the previous six duties are debuggable. A harness with bad traces is untestable in production; a harness with good traces lets one engineer reconstruct a five-step failure path in two minutes.
Code: a minimum-viable harness in Python
The whole anatomy fits in one ~140-line example — pedagogical, not production-ready, but every duty appears somewhere. Install: pip install anthropic. Uses the Anthropic SDK.
| |
The interesting thing isn’t any individual line — it’s the coupling. The accounting in _account writes into the same Usage object the budget gate reads from. The trace in _dispatch feeds the same pipeline as the trace in run. The idempotency cache makes model.call safely retryable from a higher layer. None of the duties is independent; they share state in ways the model can’t see. That coupling is the harness. Splitting the duties across libraries that don’t share state is how home-grown harnesses end up with the symptoms in the opening paragraph.
Code: the same shape via the OpenAI Agents SDK in TypeScript
For contrast, the framework version. The OpenAI Agents SDK shipped a model-native harness with sandbox execution in April 2026; the TypeScript SDK is the production-friendly path for Node/Edge surfaces. Install: npm install @openai/agents zod.
| |
The SDK collapses the loop, the tool_result plumbing, parallel dispatch, and trace emission. What you still own: the calibration of maxTurns, the wall-clock deadline (wrap runner.run in Promise.race with setTimeout), the dollar ceiling (SDK reports usage, doesn’t enforce a cap), idempotency on mutating tools. The framework moves the seam; it doesn’t remove it.
The 2026 framework landscape
Four buckets, each with a different theory of where the harness boundary sits. Model-vendor harnesses — Claude Agent SDK and the OpenAI Agents SDK — power the vendors’ own products (Claude Code, Codex); opinionated, fast-moving, tightly coupled to one API surface; strict mode, prompt caching, parallel tool calls “just work” because SDK and API ship together. Portability is the trade. Graph harnesses — LangGraph (1.0 GA in 2025, 1.2 shipped May 11, 2026) and Google’s ADK — model the agent as a state graph with checkpointed transitions; strong on durable execution, weaker on cache hygiene because the graph engine is one abstraction away from cache_control markers. TypeScript-native — Mastra (@mastra/[email protected] on April 30, 2026), batteries-included with memory, RAG, tools, evals, and observability in one package for Node/Edge runtimes. Durable-execution — Temporal, Restate, Inngest — not agent frameworks but workflow engines you build the loop on top of, as the long-horizon reliability article detailed. Common pairings: LangGraph + Claude Agent SDK, Mastra + Inngest, LangGraph wrapping Temporal.
Build vs buy
The reflex is to pick a framework; the right decision is calibration-driven. Off-the-shelf harnesses come pre-calibrated for a particular workload shape — chat with light tool use, coding agents, research workflows. When your workload’s calibration assumptions match, you ship months faster. When they diverge, the framework’s defaults become a tax you pay on every turn, and its abstractions hide exactly the seams you need to optimize. Concrete assumptions to check: step cap defaults (Vercel AI SDK’s stepCountIs(20), OpenAI Agents SDK’s maxTurns, LangGraph’s recursion limit) against your actual task distribution; default tool-error handling (silent retry vs surface); cache breakpoint control (cache_control exposed or managed opaquely); state checkpointing semantics (between-node vs durable); telemetry shape (does the schema match your stack); provider coupling (one API or many, at what feature-parity cost).
If three or more of those assumptions are wrong for your workload, the framework will fight you on every iteration. If none are wrong, buy without guilt — Anthropic’s own building effective agents recommends starting with direct API calls precisely because most teams don’t need the framework’s abstractions yet. The build case is not “frameworks are bad” — it’s “the framework’s default calibration must approximately match your workload’s, or you’re paying for assumptions you don’t get to benefit from.” A reasonable progression: start with a 200-line custom harness on the vendor SDK; when the runtime hits a complexity wall — usually durable execution, multi-tenant cache discipline, or multi-agent coordination — adopt the framework whose calibration matches yours most cleanly. The pre-framework prototype gives you the vocabulary to read the docs and the empirical numbers to evaluate the defaults.
Trade-offs, failure modes, gotchas
The model thinks it’s running the show; the harness is. The most common architecture mistake is asking the model to enforce something the harness should enforce — “respect the budget”, “don’t loop”, “be careful with the database.” The model has no mechanism to enforce any of those across calls; it sees one turn at a time. The prompt should not contain the word “budget”.
Hidden state coupling between duties. The seven duties share state through the same in-memory objects. A “clean architecture” that splits accounting into one module, telemetry into another, dispatch into a third, with no shared state, will silently emit traces that don’t match the budget that doesn’t match the invoice. Either share state explicitly or stop claiming to have a coherent runtime.
Cache invalidation hidden in a refactor. Renaming a tool, reordering tools in the array, adding a comma to the system prompt — every textual change to the prefix invalidates the cache. The harness should assert the tokenized prefix hash in CI, with a deliberate update step when intentional.
Provider differences that leak past the harness. Anthropic’s tool_choice is {type: "auto" | "any" | "tool" | "none"}; OpenAI’s is "auto" | "required" | "none". Anthropic’s stop reason is "tool_use"; OpenAI’s is "tool_calls". A harness that abstracts over both ends up lossy (smallest-common surface) or leaky (the abstraction breaks on advanced features). Pick one provider as canonical and translate at the boundary.
Multi-tenancy collapses caches. A harness serving 10,000 users with per-user prompts is writing 10,000 distinct caches and reading from none of them. Use prompt_cache_key (OpenAI) or a stable prefix structure (Anthropic) to share the cacheable prefix across tenants; put tenant-specific content after the breakpoint. Forgetting this is how a “we turned on caching” report and a flat cost graph coexist.
Sandbox boundaries belong to the harness, not the tool. The OpenAI Agents SDK’s April 2026 sandbox and Claude Code’s permission gate live in the harness because the trust boundary is the harness’s job. A tool that sandboxes itself ships its own kernel — duplication and security risk in one move.
Eval the harness, not just the model. A model upgrade is a small change relative to a harness refactor; the second one is what breaks production. Run an eval suite end-to-end against frozen tasks and gate harness changes behind it the same way you gate prompt changes.
Further reading
- Phil Schmid — “The importance of Agent Harness in 2026” — the cleanest piece on the harness as the new competitive surface beyond models. CPU/RAM/OS/app framing, build-to-delete, the trajectory-data argument.
- Martin Fowler — “Harness engineering for coding agent users” — harness-as-context-engineering applied to coding agents, with the guides/sensors decomposition and the “Agent = Model + Harness” equation.
- Anthropic — “Building effective agents” — the right starting point on when not to reach for a framework. Simplicity-first defaults underpin every downstream harness decision.
- Inngest — “Your Agent Needs a Harness, Not a Framework” — the durable-execution case for treating the harness as workflow infrastructure.
What to read next
- Conversation Compaction: Keeping Long Sessions Alive — the deep dive on duty 1’s hardest sub-problem and duty 5’s worst failure mode. Reactive vs preemptive triggers, cache-aware surgical deletion, circuit breakers, snapshot-and-rollback, and append-only memory journals as the architectural alternative.
- The Agent Loop: ReAct and Its Descendants — the loop body the harness wraps. Today’s duties are the operational surface for the model/harness split.
- Long-Horizon Task Reliability — the recovery side of harness work at the extreme. Saga compensations, checkpointing, abort-vs-retry — all enforced by the harness, not the model.
- Prompt Caching: Reusing the KV Cache Across Calls — duty 4 deep dive. The cache discipline the harness owns and the lever with the biggest economic effect on a production agent.