jatin.blog ~ $
$ cat ai-engineering/agent-budgets-and-runaway-prevention.md

Agent Budgets and Runaway Prevention

Step caps, deadlines, token and dollar ceilings, oscillation detection — the OS and distributed-systems primitives every agent harness ports.

Jatin Bansal@blog:~/ai-engineering$ open agent-budgets-and-runaway-prevention

In November 2025 a team running four LangChain agents watched an Analyzer and a Verifier fall into mutual recursion. The Analyzer produced an analysis; the Verifier asked for further analysis; the Analyzer obliged; eleven days later the bill was $47,000. The team had observability — the dashboard showed the spend climbing — but no enforcement. Alerts are asynchronous: by the time the page fires, every API call between the alert and the human reading it has already happened, and on a long weekend that gap is the entire damage window. A second account from the same year: a 35-engineer SaaS shop with an $87,000 April 2026 bill, one developer’s autonomous refactoring weekend burning $4,200. None of these incidents involved a malicious agent. They were healthy agents running healthy loops, until they weren’t, and the only thing standing between healthy and catastrophic was a budget the harness either had or didn’t.

Opening bridge

Yesterday’s piece on PII detection and data privacy closed the second-to-last chapter of the Production & Operations defense-in-depth story: detection, transformation, residency for personal data. Guardrails framed the model boundary as a place where five attack classes converge; PII covered category four; today’s piece is the fifth and last layer in the same defense-in-depth stack — economic safety as a first-class boundary. Every other article in this subtree has assumed budgets exist somewhere. The agent loop article named a step cap as the floor of the abstraction and promised “a dedicated article later in the curriculum will go deeper on budgets and runaway prevention.” The long-horizon reliability article said “the agent-loop budgets still apply — they just fire later.” The agent harness anatomy article called cost accounting duty 6 and named the budget gate as the place where the harness enforces what the model can’t. Today we make all three of those concrete: the seven primitives every agent budget actually needs, the OS and distributed-systems heritage each one inherits, the cost math that tells you when each gate earns its complexity, and the runnable harness implementations.

Definition

An agent budget is the harness-enforced set of preconditions checked before every step that, when any one fails, terminates the run with persisted partial state — not after, not during, before the next side effect. Three properties distinguish a budget from observability or rate-limiting. It is enforced inside the request path, not by a sidecar or a cron job — the check happens synchronously between the loop’s iteration and the next provider call. It is the disjunction of multiple independent predicates, not a single number — step caps, wall-clock deadlines, token ceilings, dollar caps, tool quotas, oscillation detectors, and external abort signals all fire on the same conditional. And it terminates with persisted partial state, not silently — the aborted-with-partial-results outcome is a first-class success state the way the long-horizon-reliability article made explicit, not a failure to be retried.

The framing the rest of the article will return to: alerts are not enforcement. The $47K incident’s lesson is that the gap between the alert firing and the session stopping is exactly the window where damage compounds, and shrinking that gap to zero is the budget’s job. Every observability platform — Langfuse, LangSmith, Phoenix — surfaces token counts and dollar costs on dashboards; none of them, by default, stop the next call when the count crosses a threshold. The trace store is the forensic record; the budget gate is the kill-switch. The production tracing article covered the first; today’s piece covers the second.

The distributed-systems and OS parallels

The agent budget surface didn’t have to be invented. It is a port of two decades of operating-system process control and three decades of distributed-systems reliability engineering, with the names changed.

The OS heritage. Every primitive on the budget list has a Unix ancestor. The step cap is ulimit -u — the maximum number of processes a user can spawn, the classic fork-bomb backstop that turns :(){ :|:& };: from a denial-of-service into an EAGAIN at the right boundary. The wall-clock deadline is SIGALRM from alarm(3), or the modern setrlimit(RLIMIT_CPU) — the kernel reaches into the process after N seconds and signals it. The token budget is the memory ceiling — RLIMIT_AS and the cgroup memory limit, with the OOM killer as the enforcement mechanism when a process tries to allocate past it. The dollar budget is the cgroup CPU quota and the fair_sched accounting layer: a per-tenant ceiling on a fungible resource, accounted online, enforced by descheduling when the bucket runs dry. The per-tool quota is RLIMIT_NOFILE — the cap on file descriptors that prevents one process from monopolizing the I/O subsystem. The oscillation detector is the watchdog timer: a kernel thread that resets a counter periodically and triggers a hard reset when the counter doesn’t get fed. Every kernel needs an OOM killer; every agent needs an enforced budget. That is not a metaphor — it is the same primitive at a different layer.

The distributed-systems heritage. The other half of the surface is patterns from microservices reliability. Timeouts propagate end-to-end so that an upstream call doesn’t outlive the downstream’s expectations — the same discipline that powers gRPC’s deadline propagation belongs on the agent loop. Rate limiterstoken buckets and leaky buckets — sit on the per-tool quota: a tool can be called at most R times per second, with a burst capacity of B. Circuit breakers sit on the loop itself: count failures, trip after N, fall back; the same Hystrix-style discipline the tool-use article covered, but with the loop as the protected resource rather than a single downstream. Bulkheads — separate connection pools per downstream, so a slow tool can’t starve a fast one — port to per-tool concurrency caps inside the agent runtime. Retry caps with jitter prevent the retry-storm pathology that drives transient errors into permanent outages. Graceful degradation turns “we hit the limit” from an exception into a structured partial-result response. Every one of these patterns shows up in the agent budget surface because the agent loop is a distributed system whose endpoints happen to be model calls.

The mapping isn’t ornamental. The mature agent-budget surface in 2026 reads like a port of man 2 setrlimit and the AWS Builders’ Library — the same problems with the same mitigations, and the same shape of incident report when the mitigations are missing.

The seven primitives

Every defensible agent budget is the disjunction of at least the following predicates. A harness that ships only a step cap has shipped 14% of the surface; the rest of this section walks the remaining 86%.

1. Step cap

The simplest predicate: iteration count exceeds N. Every framework ships this with a default — Vercel AI SDK’s stepCountIs(20) is the documented default, the OpenAI Agents SDK’s maxTurns defaults to 10 and raises MaxTurnsExceededError when exceeded, LangGraph’s recursion_limit defaults to 25 and raises GraphRecursionError on breach. Each is a wrapper over the same if step >= cap: abort check.

The trap is treating the default as load-bearing. The Vercel default of 20, OpenAI’s 10, LangGraph’s 25 — these are calibration choices the framework author made for their median user, not for your workload. A research agent that needs 80 turns will hit MaxTurnsExceededError long before completing a real task; a customer-service agent with a step cap of 80 will burn a quarter of an hour on the wrong path before the cap fires. Tune the step cap to your workload’s 95th-percentile completion turn count plus a safety margin. Run a sample of real tasks, observe the distribution, set the cap at p95 + 20%. The cap is the fork-bomb backstop, not the typical termination point — the typical termination is stop_reason != "tool_use" (the model itself stopping). If the cap is firing often, the cap is wrong or the model is stuck and you need finer-grained detection (sections 6 and 7 below).

2. Wall-clock deadline

The next predicate: time.monotonic() - started > deadline_s. Step cap is the count limit; deadline is the time limit, and they fire under different failure modes. A model that makes one slow tool call per step can reach step 5 and have burned 60 seconds; a model that makes a hundred fast calls can stay under a 60-second wall-clock but exceed a 50-step cap. Both deserve to terminate.

The discipline that matters: the deadline is per-run, not per-call. A 60-second per-call deadline on a 50-step run gives the agent 50 × 60 = 3000 seconds — fifty minutes — before either limit fires. The right shape is total_deadline = 60s; per_call_deadline = min(remaining_budget, max_per_call). The remaining budget shrinks monotonically; the per-call deadline tightens as the run ages, so the last step gets less wall-clock than the first. This is the same shape as gRPC’s deadline propagation: the upstream’s deadline is the ceiling on every downstream call, with attenuation for the time already consumed.

The wall-clock deadline interacts with the streaming surface from the streaming article. Cancellation must (a) close the upstream connection so the provider stops decoding (and billing), (b) abort in-flight tool executions whose results will never be used, (c) flush partial state to the conversation log. A deadline that fires but doesn’t propagate cancellation is the worst of both worlds — the run “stops” by Python’s metric, but the provider is still decoding the last response and the bill keeps climbing.

3. Token ceiling

The third predicate: cumulative input + output tokens exceed max_tokens_total. Tokens are the proxy for both cost and context blow-up. The trap from the LeanOps incident analysis: a 4,000-token initial context doubling at each step reaches 128K at step 5 and overflows the model’s window by step 15. A token ceiling that fires at 200K total caps the worst case at the cost of two-and-a-half full-context API calls — meaningful in dollar terms, decisive in fork-bomb terms.

Token accounting must break down by cache state, per the prompt-caching article and the agent harness anatomy article. A harness that aggregates “tokens used” without separating cache_read_input_tokens from cache_creation_input_tokens will report a number that bears no relationship to the bill. The right shape:

text
1
2
3
tokens_used = input_tokens + output_tokens + cache_read + cache_write
dollars_used = (input × price_in + output × price_out
                + cache_read × price_cache_read + cache_write × price_cache_write) / 1e6

The dollar conversion needs per-provider, per-model, per-cache-tier pricing — and pricing changes quarterly. Pin the pricing-table version into the budget gate’s audit log (the same cost.pricing_version attribute the observability article recommended for spans) so historical comparisons stay coherent across rate changes.

4. Dollar ceiling

The fourth predicate: cumulative dollar cost exceeds max_dollars. Dollars and tokens diverge because the dollar/token ratio depends on cache hit rate, model tier, and the input/output split — see the cost-optimization and model-routing article for the full math. A token ceiling alone doesn’t catch the cost runaway that happens when a model-routing decision escalates the run from Haiku to Opus mid-loop; a dollar ceiling alone doesn’t catch the context-bloat runaway from a token explosion at constant cost-per-token. Run both.

The defensible production cut: per-run dollar ceiling + per-tenant daily dollar ceiling + per-tenant monthly dollar ceiling. The per-run cap stops the $47K incident at $50 instead of $47,000. The per-tenant daily cap stops the legitimate-but-runaway weekend pattern — one developer, $4,200, three days — at $500 instead of $4,200. The per-tenant monthly cap is the financial backstop: the line beyond which the platform team gets paged regardless of cause. These compose; the run aborts when any of them fires. The accounting state lives in Redis or a similar low-latency store keyed by tenant and time window; the budget gate reads the counter before every call.

5. Per-tool quota

The fifth predicate: calls to tool T this run exceed max_calls_per_tool[T]. A model that calls search_web 200 times in 10 steps is doing something wrong — either it’s stuck (no-progress detection should fire) or it’s interpreting the task wrong (the cap should fire as the backstop). Per-tool quotas are the agent equivalent of RLIMIT_NOFILE: a single tool monopolizing the runtime is itself a failure mode, and the per-tool budget is independent of the global step cap.

The right granularity is by tool class, not always by tool name. Mutating tools (charge_card, send_email, delete_record) get tight per-run caps measured in single digits. Read-only tools (search_web, read_file, query_db) get higher caps measured in dozens. The tool-selection-at-scale article covered the namespace discipline; per-tool quotas inherit from the same namespace tree: tools/mutating/* shares one bucket, tools/read/* shares another. Tool-class quotas are easier to keep consistent across deploys than per-tool caps that drift as the tool catalog grows.

6. No-progress detection

The sixth predicate: the last K tool calls were identical. The simplest no-progress detector hashes (tool_name, sorted_args) and looks for repeats in a sliding window — three identical calls in a row is decisive evidence the model is stuck, and no number of additional steps will help. The agent loop article showed the dumb-but-effective Python:

python
1
2
3
last_calls.append((tool_name, json.dumps(args, sort_keys=True)))
if last_calls[-3:].count(last_calls[-1]) >= 3:
    raise RuntimeError("no-progress: 3x repeat")

That’s the floor. The ceiling is oscillation detection — the model alternating between two states without making progress, which the dumb detector misses because no single call repeats three times in a row. The Analyzer/Verifier pattern from the $47K incident is exactly this: A → V → A → V → A → V, where no individual call repeats but the pair does, indefinitely. The fix is to hash the call signature plus the previous call’s signature and look for repeated pairs:

python
1
2
3
4
def is_oscillating(history, window=6):
    if len(history) < window: return False
    pairs = list(zip(history[-window::2], history[-window+1::2]))
    return len(set(pairs)) == 1

Three repeated pairs in six steps is oscillation; six identical alternating calls is the Analyzer/Verifier shape exactly. Production no-progress detectors typically run both predicates — single-call repetition for the dumb-stuck case, pair-repetition for the alternation case — and a more sophisticated entropy-based detector for the long-horizon meltdown case that the long-horizon reliability article covered.

7. External abort signal

The seventh predicate: an external signal sets aborted = true. The other six predicates are reactive — they fire on state the loop owns. The external signal is for state the loop doesn’t own: an operator dashboard hitting a kill switch, an SLO alert firing on a different service, a budget breach on a different run by the same tenant. The Temporal example from the long-horizon reliability article is exactly this pattern — setHandler(abortSignal, () => { aborted = true; }) plus a while (!aborted) check in the loop body. The external signal is the operational pressure relief valve: when the budget gates are calibrated wrong and the run is climbing toward a dollar ceiling that’s higher than it should be, the operator’s kill switch is the last line of defense.

The signal must be checked at every iteration, not just at “natural” boundaries. A loop body that does five tool calls in parallel between checking the abort signal is five tool calls past the operator’s intent — the same shape as a thread that doesn’t check a cancellation token between synchronous operations. The discipline is to interleave the signal check with every step.

The order of evaluation matters

The seven predicates are a disjunction — any one firing aborts the run — but the order in which they’re evaluated determines which one gets credited for the abort, and that matters for the audit log. The right order is cheap-and-decisive first:

  1. External abort signal (cheap, decisive)
  2. Step cap (cheap, decisive)
  3. Wall-clock deadline (cheap, decisive)
  4. Dollar ceiling (cheap once usage is tracked, decisive)
  5. Token ceiling (cheap, decisive — typically redundant with dollar but cheaper to compute)
  6. Per-tool quota (cheap, per-tool dictionary lookup)
  7. No-progress detection (cheapest is O(K) hash lookup over history)

Then a debounce: predicates fire before the next call, never after. The check happens at the top of the loop body, and crucially before the API request goes out — checking after the fact is how you accidentally double your spend right at the limit. The cost-accounting and budget-check code paths share state, so both read from the same usage object the API response updated last iteration. The harness anatomy article made the case for shared state across duties; budgets are the load-bearing example.

Cost math: when each gate earns its complexity

Step cap and wall-clock deadline cost nothing — counter increments and timer reads, sub-microsecond per turn. Token and dollar accounting cost the small overhead of summing a few integers and a multiplication. Per-tool quotas cost a dictionary lookup and an increment. No-progress detection costs an O(K) scan over the last K tool calls. External signal-checking costs a single atomic read. The total budget-gate overhead is well under a millisecond per iteration, against API call latencies of hundreds of milliseconds to several seconds. The cost of the budget is the cost of forgetting to ship one of the predicates.

The dollar math the cost-optimization article’s framing extends here: the breakeven for adding a budget gate is the cost of one runaway incident divided by the development cost of the gate. A no-progress detector takes an engineer a day; the $47K incident pays for that engineer’s salary for the year. A dollar ceiling enforced server-side takes a week; the $87K incident pays for the week. A full audit log of which gate fired when, with a one-page incident playbook attached, takes a month; the difference between catching the next runaway at $500 and at $50,000 pays for the month several times over. The budget surface is the single highest-leverage piece of harness work in the production stack, measured in dollars per engineer-hour spent.

The framing the rest of the platform tooling supports — billing alerts at OpenAI, Anthropic’s quota and rate limit settings, the Anthropic pricing page and its quota-management dashboards — provides the financial backstop above the harness gate. They are not substitutes for the harness gate; the platform-level cap fires after the platform has run for hours, the harness-level gate fires before the next call. Both. Always both.

Code: a budgeted Python harness with the Anthropic SDK

A harness that ships all seven primitives in the smallest defensible code. The example uses the Anthropic SDK; the same shape ports to OpenAI by swapping the API client. Install: pip install anthropic.

python
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
import json, time, hashlib, threading
from collections import defaultdict
from dataclasses import dataclass, field
from typing import Callable, Literal
from anthropic import Anthropic

client = Anthropic()

# Per-million pricing as of May 2026.
PRICES = {
    "claude-opus-4-7":   {"input": 5.0, "output": 25.0, "cache_read": 0.5, "cache_write": 6.25},
    "claude-sonnet-4-6": {"input": 3.0, "output": 15.0, "cache_read": 0.3, "cache_write": 3.75},
    "claude-haiku-4-5":  {"input": 1.0, "output":  5.0, "cache_read": 0.1, "cache_write": 1.25},
}

@dataclass
class Budget:
    max_steps: int = 25
    max_seconds: float = 120.0
    max_tokens: int = 200_000
    max_dollars: float = 1.50
    # tool_class -> max calls per run; '*' is the global fallback.
    max_calls_per_tool: dict[str, int] = field(
        default_factory=lambda: {"mutating": 5, "read": 40, "*": 60}
    )
    # No-progress windows.
    no_progress_streak: int = 3  # K identical calls in a row
    oscillation_window: int = 6  # K-step alternation with a single pair

@dataclass
class Usage:
    input_tokens: int = 0
    output_tokens: int = 0
    cache_read: int = 0
    cache_write: int = 0
    dollars: float = 0.0

class BudgetBreach(Exception):
    """Raised by the gate. Carries which predicate fired for the audit log."""
    def __init__(self, predicate: str, detail: str):
        super().__init__(f"{predicate}: {detail}")
        self.predicate = predicate
        self.detail = detail

class BudgetedHarness:
    def __init__(
        self, model: str, system: str, tools: list[dict],
        tool_classes: dict[str, Literal["mutating", "read"]],
        dispatch: Callable[[str, dict], dict],
        budget: Budget,
    ):
        self.model = model
        self.system = [{"type": "text", "text": system, "cache_control": {"type": "ephemeral"}}]
        self.tools = tools
        self.tool_classes = tool_classes  # tool_name -> "mutating" | "read"
        self.dispatch = dispatch
        self.budget = budget
        self.usage = Usage()
        # The external abort signal — thread-safe so a dashboard can flip it.
        self.aborted = threading.Event()
        # Mutable state the gate inspects.
        self.tool_calls: dict[str, int] = defaultdict(int)  # tool_name -> count
        self.call_history: list[tuple[str, str]] = []  # (tool_name, args_hash)

    # The gate. Cheap predicates first; raise on the first one that fires.
    def _check_budget(self, started: float, step: int) -> None:
        if self.aborted.is_set():
            raise BudgetBreach("external_abort", "kill switch")
        if step >= self.budget.max_steps:
            raise BudgetBreach("step_cap", f"step={step} >= {self.budget.max_steps}")
        elapsed = time.monotonic() - started
        if elapsed > self.budget.max_seconds:
            raise BudgetBreach("deadline", f"elapsed={elapsed:.1f}s > {self.budget.max_seconds}s")
        if self.usage.dollars > self.budget.max_dollars:
            raise BudgetBreach("dollar_ceiling",
                               f"${self.usage.dollars:.4f} > ${self.budget.max_dollars}")
        total_tokens = (self.usage.input_tokens + self.usage.output_tokens
                        + self.usage.cache_read + self.usage.cache_write)
        if total_tokens > self.budget.max_tokens:
            raise BudgetBreach("token_ceiling",
                               f"{total_tokens} > {self.budget.max_tokens}")
        # No-progress: streak of K identical calls.
        history = self.call_history
        if len(history) >= self.budget.no_progress_streak and \
           len(set(history[-self.budget.no_progress_streak:])) == 1:
            raise BudgetBreach("no_progress_streak",
                               f"{self.budget.no_progress_streak}x {history[-1][0]}")
        # Oscillation: K-step alternation between the same two calls.
        window = self.budget.oscillation_window
        if len(history) >= window:
            pairs = list(zip(history[-window::2], history[-window+1::2]))
            if len(set(pairs)) == 1:
                raise BudgetBreach("oscillation",
                                   f"alternation: {pairs[0][0][0]} <-> {pairs[0][1][0]}")

    def _check_tool_quota(self, tool_name: str) -> None:
        """Separate gate, fires *before* a specific tool's dispatch."""
        cls = self.tool_classes.get(tool_name, "*")
        cap = (self.budget.max_calls_per_tool.get(cls)
               or self.budget.max_calls_per_tool["*"])
        if self.tool_calls[tool_name] >= cap:
            raise BudgetBreach("tool_quota",
                               f"{tool_name} called {self.tool_calls[tool_name]}x >= {cap}")

    def _account(self, u) -> None:
        self.usage.input_tokens += u.input_tokens
        self.usage.output_tokens += u.output_tokens
        cache_read = getattr(u, "cache_read_input_tokens", 0) or 0
        cache_write = getattr(u, "cache_creation_input_tokens", 0) or 0
        self.usage.cache_read += cache_read
        self.usage.cache_write += cache_write
        p = PRICES[self.model]
        self.usage.dollars += (
            u.input_tokens * p["input"] / 1e6
            + u.output_tokens * p["output"] / 1e6
            + cache_read * p["cache_read"] / 1e6
            + cache_write * p["cache_write"] / 1e6
        )

    def run(self, user_msg: str) -> dict:
        messages = [{"role": "user", "content": user_msg}]
        started = time.monotonic()
        step = 0
        breach: BudgetBreach | None = None
        try:
            while True:
                # Gate BEFORE the call, every iteration. This is the load-bearing line.
                self._check_budget(started, step)

                resp = client.messages.create(
                    model=self.model, max_tokens=2048,
                    system=self.system, tools=self.tools, messages=messages,
                )
                self._account(resp.usage)
                messages.append({"role": "assistant", "content": resp.content})

                if resp.stop_reason != "tool_use":
                    return self._exit("complete", step, messages, breach=None,
                                      final=resp.content)

                # Dispatch tools with per-tool quota gate.
                results = []
                for block in resp.content:
                    if block.type != "tool_use":
                        continue
                    self._check_tool_quota(block.name)
                    sig = (block.name,
                           hashlib.sha1(json.dumps(block.input, sort_keys=True)
                                        .encode()).hexdigest()[:12])
                    self.call_history.append(sig)
                    self.tool_calls[block.name] += 1
                    try:
                        out = self.dispatch(block.name, block.input)
                        results.append({"type": "tool_result", "tool_use_id": block.id,
                                        "content": json.dumps(out)})
                    except Exception as e:
                        results.append({"type": "tool_result", "tool_use_id": block.id,
                                        "is_error": True,
                                        "content": f"{type(e).__name__}: {e}"})
                messages.append({"role": "user", "content": results})
                step += 1
        except BudgetBreach as bb:
            breach = bb
            return self._exit("aborted", step, messages, breach=bb, final=None)

    def _exit(self, status: str, step: int, messages: list,
              breach: BudgetBreach | None, final) -> dict:
        return {
            "status": status,                       # "complete" | "aborted"
            "breach": breach.predicate if breach else None,
            "detail": breach.detail if breach else None,
            "step": step,
            "usage": self.usage,
            "tool_calls": dict(self.tool_calls),
            "messages": messages,                   # persisted partial state
            "final": final,
        }

Four properties of the shape are worth internalizing. The gate is checked before the API call, every iteration — checking after is how you accidentally overspend right at the limit. BudgetBreach carries the predicate name so the audit log records which budget fired, not just “the run aborted.” The per-tool quota is a separate gate with its own check point, fired right before that tool’s dispatch — letting a submit_final_answer tool through after the global step cap fires is exactly the bug pattern you don’t want. The aborted-with-partial-state path returns the same shape as success — the caller distinguishes by status, gets the same messages, usage, and tool_calls regardless. The aborted run is a first-class outcome, not a failure to retry.

The threading.Event for aborted is the load-bearing piece for the external signal — a dashboard endpoint, an SLO alert handler, or a higher-level orchestrator can call harness.aborted.set() and the next iteration’s gate fires. In production the same pattern composes with Temporal’s signalWithStart when the harness runs as a durable workflow, but the in-process Event is the floor.

Code: a TypeScript budgeted harness with the Vercel AI SDK

The TypeScript story is shaped by the Vercel AI SDK’s stopWhen API — a composable disjunction of predicates the SDK evaluates after each step. Install: npm install ai @ai-sdk/anthropic zod.

typescript
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
import { anthropic } from "@ai-sdk/anthropic";
import {
  generateText, tool, stepCountIs, hasToolCall,
  type ToolSet,
} from "ai";
import { z } from "zod";
import crypto from "node:crypto";

interface Budget {
  maxSteps: number;
  maxSeconds: number;
  maxTokens: number;
  maxDollars: number;
  maxCallsPerToolClass: Record<"mutating" | "read" | "*", number>;
  noProgressStreak: number;
  oscillationWindow: number;
}

const DEFAULT_BUDGET: Budget = {
  maxSteps: 25,
  maxSeconds: 120,
  maxTokens: 200_000,
  maxDollars: 1.5,
  maxCallsPerToolClass: { mutating: 5, read: 40, "*": 60 },
  noProgressStreak: 3,
  oscillationWindow: 6,
};

// Per-million pricing for Claude Opus 4.7 as of 2026-05.
const PRICE = { in: 5.0, out: 25.0, cacheRead: 0.5, cacheWrite: 6.25 };

interface BudgetState {
  started: number;
  tokensTotal: number;
  dollars: number;
  toolCalls: Record<string, number>;
  callHistory: Array<{ name: string; argsHash: string }>;
  abortRequested: { value: boolean };
  toolClasses: Record<string, "mutating" | "read">;
  budget: Budget;
}

function hash(input: unknown): string {
  return crypto.createHash("sha1").update(JSON.stringify(input)).digest("hex").slice(0, 12);
}

function checkOscillation(history: BudgetState["callHistory"], window: number): boolean {
  if (history.length < window) return false;
  const slice = history.slice(-window);
  const pairs = new Set<string>();
  for (let i = 0; i < slice.length - 1; i += 2) {
    pairs.add(`${slice[i].name}:${slice[i].argsHash}|${slice[i+1].name}:${slice[i+1].argsHash}`);
  }
  return pairs.size === 1;
}

// The composed stop predicate. Returns true to halt the loop.
function makeBudgetGate(state: BudgetState) {
  return ({ steps }: { steps: unknown[] }) => {
    if (state.abortRequested.value) return true;
    const elapsed = (Date.now() - state.started) / 1000;
    if (elapsed > state.budget.maxSeconds) return true;
    if (state.dollars > state.budget.maxDollars) return true;
    if (state.tokensTotal > state.budget.maxTokens) return true;
    // No-progress streak.
    const h = state.callHistory;
    if (h.length >= state.budget.noProgressStreak) {
      const tail = h.slice(-state.budget.noProgressStreak);
      if (new Set(tail.map(c => `${c.name}:${c.argsHash}`)).size === 1) return true;
    }
    if (checkOscillation(h, state.budget.oscillationWindow)) return true;
    return false;
  };
}

// Per-tool quota enforcement happens inside the tool wrapper — by the time
// stopWhen sees the step, the tool would already have executed.
function wrapToolWithQuota<TArgs>(
  name: string, originalExecute: (args: TArgs) => Promise<unknown>,
  state: BudgetState,
) {
  return async (args: TArgs) => {
    const cls = state.toolClasses[name] ?? "*";
    const cap = state.budget.maxCallsPerToolClass[cls]
              ?? state.budget.maxCallsPerToolClass["*"];
    const current = state.toolCalls[name] ?? 0;
    if (current >= cap) {
      // Surface as a tool-result error so the model can recover or stop.
      return { error: `tool_quota_exceeded`, tool: name, calls: current, cap };
    }
    state.toolCalls[name] = current + 1;
    state.callHistory.push({ name, argsHash: hash(args) });
    return originalExecute(args);
  };
}

// Example tools.
const searchWeb = tool({
  description: "Search the web. Read-only.",
  inputSchema: z.object({ query: z.string(), k: z.number().int().default(5) }),
  execute: async ({ query, k }) =>
    [{ title: "Example", url: "https://example.com/x", snippet: "..." }],
});

const submitFinal = tool({
  description: "Submit the final answer. Terminates the loop.",
  inputSchema: z.object({ answer: z.string() }),
  execute: async ({ answer }) => ({ answer }),
});

export async function runWithBudget(
  goal: string,
  budget: Budget = DEFAULT_BUDGET,
  abortRequested: { value: boolean } = { value: false },
) {
  const state: BudgetState = {
    started: Date.now(),
    tokensTotal: 0,
    dollars: 0,
    toolCalls: {},
    callHistory: [],
    abortRequested,
    toolClasses: { searchWeb: "read", submitFinal: "read" },
    budget,
  };

  // Wrap each tool with the quota gate.
  const wrappedTools: ToolSet = {
    searchWeb: { ...searchWeb,
                 execute: wrapToolWithQuota("searchWeb", searchWeb.execute!, state) },
    submitFinal: { ...submitFinal,
                   execute: wrapToolWithQuota("submitFinal", submitFinal.execute!, state) },
  };

  const result = await generateText({
    model: anthropic("claude-opus-4-7"),
    tools: wrappedTools,
    prompt: goal,
    // Compose: any predicate firing halts. Step cap and hasToolCall are
    // SDK-provided; the inline gate carries the other five predicates.
    stopWhen: [
      stepCountIs(budget.maxSteps),
      hasToolCall("submitFinal"),
      makeBudgetGate(state),
    ],
    // Track usage per step so the gate's dollar/token state stays current.
    onStepFinish: ({ usage, providerMetadata }) => {
      const input = usage?.inputTokens ?? 0;
      const output = usage?.outputTokens ?? 0;
      const cacheRead =
        (providerMetadata?.anthropic?.cacheReadInputTokens as number) ?? 0;
      const cacheWrite =
        (providerMetadata?.anthropic?.cacheCreationInputTokens as number) ?? 0;
      state.tokensTotal += input + output + cacheRead + cacheWrite;
      state.dollars += (input * PRICE.in + output * PRICE.out
                       + cacheRead * PRICE.cacheRead + cacheWrite * PRICE.cacheWrite) / 1e6;
    },
  });

  const aborted = state.abortRequested.value
    || (Date.now() - state.started) / 1000 > budget.maxSeconds
    || state.dollars > budget.maxDollars
    || state.tokensTotal > budget.maxTokens;

  return {
    status: aborted ? "aborted" : "complete",
    text: result.text,
    steps: result.steps.length,
    usage: { tokens: state.tokensTotal, dollars: state.dollars },
    toolCalls: state.toolCalls,
  };
}

The shape is deliberately close to the Python version — the predicates, the order of evaluation, the per-tool wrapper, the partial-state return. The framework-specific seam is the stopWhen composition: where Python’s harness runs the gate inline, the SDK runs it between steps via stopWhen. The per-tool quota has to live inside the tool wrapper because by the time stopWhen sees the step, the tool would already have executed; surfacing a tool_quota_exceeded error through the tool result is the cleanest way to let the model either pick a different tool or terminate gracefully. The abortRequested ref-cell is the external-signal hook: any external caller can flip abortRequested.value = true and the next stopWhen evaluation halts the loop.

The trade-off the SDK shape buys: the stopWhen API checks predicates after each step rather than before, so the very last call before a breach still executes. The corresponding mitigation is the per-step budget conservatism — the dollar ceiling at 95% of the platform-level cap, so a single overshoot doesn’t cross the hard line. Production hybrids run a hard pre-call gate inside onStepFinish (set a flag, raise on next entry) plus stopWhen as the SDK-friendly halt mechanism.

Trade-offs, failure modes, gotchas

The model thinks it’s running the show; the budget gate isn’t. Asking the model to “respect the budget” or “stop when you’ve called the same tool five times” doesn’t enforce a budget any more than asking a process to enforce its own scheduling quantum. The model has no mechanism to add up tokens across calls; it sees one turn at a time. The prompt should not contain the word “budget.” This is the same load-bearing claim the agent harness anatomy article made for the kernel/userspace split — the model is the policy; the harness is the kernel; the budget is the kernel’s scheduler.

Silent retries double the cost. Every transient-error retry in dispatch — the network blip, the 503, the rate-limit — pays a full provider call. A retry policy with exponential backoff and a cap of 3 attempts means the worst-case overhead on a budget-breach scenario is 3× the bill, not 1×. The mitigation is to charge retries against the budget: every retry decrements the dollar and token counters the same way the original call did, so the gate fires on the cumulative cost, not on the nominal step count. The tool-use article’s circuit-breaker treatment ports here: trip the breaker on the tool after N failures, fall back to a structured tool_result error, let the model recover or abort, but don’t keep retrying silently.

Doubled side effects on a budget breach. When the budget fires during a step that has already issued a tool_use, the harness has a choice: roll back the tool’s side effect (often impossible — emails sent are not unsent), tag the tool result as orphaned in the conversation log (the model never sees it; the cache is dirty), or persist it and re-enter at recovery time (the long-horizon reliability article’s saga-compensation discipline is the right shape). The honest answer is idempotency keys on every mutating tool, plus a partial-state record that the recovery process consults to avoid double-dispatch. A budget gate that fires mid-step without a partial-state record can leave the system in an inconsistent state worse than a slightly higher bill.

Watchdog-killed-the-summarizer. A wall-clock deadline that fires during a compaction pass leaves the session in the same wedged state the conversation-compaction article’s opening anecdote described — the buffer is too large to call the foreground model, and the compaction that would shrink it just died. The discipline: compaction operations are exempt from the run-level deadline, with their own (shorter) deadline and their own circuit breaker. The BudgetBreach for deadline should never fire while the compactor is running; the compactor’s own breaker fires on its own timeout. Conflating the two is how the 3am page from the conversation-compaction article happens.

The bulkhead that wasn’t: multi-tenant cost leakage. A per-run dollar ceiling stops this run at $1.50 but does nothing about the same tenant opening a new run a millisecond after the breach. The fix is the per-tenant bucket — a daily cap on cumulative spend across all runs by the same tenant, enforced by a centralized counter (Redis, Postgres, a dedicated ledger service). Without it, a runaway tenant can spawn N runs in parallel and incur N × max_dollars before any gate fires. The bulkhead pattern from microservices reliability ports directly: per-tenant pools mean a runaway tenant can’t starve fair-use tenants of capacity, and per-tenant budgets mean a runaway tenant can’t exceed their economic ceiling.

The model can bypass the cap by emitting a single 200K-token output. A max_tokens limit per call set too high lets the model emit a single huge response that blows the token budget and the dollar budget in one shot. The mitigation: cap max_tokens per call at a value materially lower than the per-run token budget (e.g., 2048 for chat-shaped agents, 8192 for synthesis tasks). This is the same shape as the “huge tool result” failure from the compaction article — a 50K-token tool result at 90K of a 100K budget blows the gate’s projection. The defensive cut is to apply the gate’s projection logic to the expected output (a conservative bound on max_tokens) before the call, not just to the actual usage after.

Recursion through delegation. When the agent calls a sub-agent (a multi-agent orchestration pattern), the sub-agent has its own budget — which means the parent’s budget can be silently multiplied by N if the parent issues N sub-agent calls. The fix is budget inheritance with attenuation: the sub-agent’s max_dollars is the remaining parent budget at the moment of delegation, not the parent’s original budget. The orchestrator article’s hierarchical-coordinator pattern carries this discipline by default; flat multi-agent designs that pass the budget by copy rather than by reference are how the LangChain Analyzer/Verifier $47K incident compounded.

The budget breach that left state in limbo. A run that aborts mid-loop without persisting what it was doing is a run the recovery process can’t pick up. The discipline is to make every BudgetBreach exit do three things in order: (a) flush the conversation log to durable storage with a sequence number, (b) emit a structured incident record with the breach predicate, the usage at the time, and the next planned step, (c) return the partial result envelope to the caller. The recovery process — manual or automated — reads the incident record and decides whether to resume from a compacted prefix (the long-horizon reliability MOP-restart pattern) or to mark the task complete-with-partial-results.

Off-by-one on the cap. A step cap of N that includes the planning step versus excludes it, a deadline of T seconds measured from request entry versus from first model call, a token ceiling that includes the system prompt versus excludes it — the off-by-one bugs in budget gates are insidious because the test that catches them looks like a unit test that passes. The defensible pattern is property-based testing against the gate: assert that no run ever exceeds N × max_per_call_tokens regardless of model behavior, assert that no run’s wall-clock exceeds the deadline + the slop of the last call’s latency. The gate is one of the few harness components where formal invariants are worth writing down.

Alerts are not enforcement. The closing claim, restated for emphasis. A dashboard that shows the spend climbing past $1,000 is observability; a gate that returns from the loop at $1.50 is enforcement. The observability platform you chose — Langfuse, LangSmith, Phoenix, Datadog — is not the budget. It is the audit log of what happened under the budget’s enforcement. The budget itself is code, deterministic, in the request path, and the gap between the alert firing and the session stopping is exactly the gap between $50 and $47,000.

Further reading from the field

  • Simon Willison — Agents tag — the running index of Willison’s posts on agent failure modes; his consistent point is that classifier-based defenses lose to adaptive attackers and that runtime enforcement at the harness is the only durable mitigation. The same logic underlies the budget gate: classifiers (rate limits, fraud-detection-style anomaly scoring) catch the median runaway, but only the synchronous in-loop gate catches the next runaway.
  • AWS Builders’ Library — Timeouts, retries, and backoff with jitter — the canonical reference for the retry-storm pathology and the jittered backoff mitigation. The discipline ports unchanged to agent loops: every retry is a full provider call, every retry counts against the budget, and the cap-with-jitter is the difference between a transient blip and a $47K bill.
  • Anthropic — Building effective agents — the December 2024 engineering writeup that distinguishes workflows from agents and recommends simplicity-first defaults. The budget gate is the simplicity-first answer to “how do we know the loop is bounded?”
  • Phil Schmid — The importance of Agent Harness in 2026 — the harness-as-kernel framing that organizes this article and the agent harness anatomy piece. The budget is the kernel’s scheduler; the model is the process; the process doesn’t get to choose its own quantum.
  • Anatomy of an Agent Harness — the runtime layer the budget gate lives inside. Duty 6 (cost accounting) is the immediate parent of today’s piece; the other six duties supply the shared state the gate inspects.
  • The Agent Loop: ReAct and Its Descendants — the loop body the budget gate wraps. The stopping-condition disjunction in the agent-loop article is the budget gate at a lower resolution; today’s piece is the full version.
  • Long-Horizon Task Reliability — the saga-compensation discipline for what to do after the budget gate fires mid-run. Partial state, idempotency keys, and the abort-vs-retry decision are the recovery primitives that complement today’s enforcement primitives.
  • Production Tracing and Observability for LLM Systems — the audit log that records which gate fired when. Observability is not enforcement, but enforcement without observability is a black box: the trace store is where the incident playbook is written from.