$ cat ai-engineering/agent-budgets-and-runaway-prevention.md

Agent Budgets and Runaway Prevention

How agent harnesses enforce step, time, token, cost, and tool-call limits.

Jatin Bansal@blog:~/ai-engineering$ open agent-budgets-and-runaway-prevention

In November 2025 a team running four LangChain agents watched an Analyzer and a Verifier fall into mutual recursion. The Analyzer produced an analysis; the Verifier asked for further analysis; the Analyzer obliged; eleven days later the bill was $47,000. The team had observability. The dashboard showed the spend climbing, but no enforcement. Alerts are asynchronous: by the time the page fires, every API call between the alert and the human reading it has already happened, and on a long weekend that gap is the entire damage window. A second account from the same year: a 35-engineer SaaS shop with an $87,000 April 2026 bill, one developer’s autonomous refactoring weekend burning $4,200. None of these incidents involved a malicious agent. They were healthy agents running healthy loops, until they weren’t, and the only thing standing between healthy and catastrophic was a budget the harness either had or didn’t.

Budgets must stop the next action

An agent budget is the harness-enforced set of preconditions checked before every step that, when any one fails, terminates the run with persisted partial state; not after, not during, before the next side effect. Three properties distinguish a budget from observability or rate-limiting. It is enforced inside the request path, not by a sidecar or a cron job. The check happens synchronously between the loop’s iteration and the next provider call. It is the disjunction of multiple independent predicates, not a single number; step caps, wall-clock deadlines, token ceilings, dollar caps, tool quotas, oscillation detectors, and external abort signals all fire on the same conditional. And it terminates with persisted partial state, not silently. The aborted-with-partial-results outcome is a first-class success state the way the long-horizon-reliability article made explicit, not a failure to be retried.

Alerts are not enforcement. The $47K incident grew during the gap between the alert and the session shutdown. Observability platforms such as Langfuse, LangSmith, and Phoenix report token counts and costs, but do not stop the next call by default. Tracing records the failure; a budget gate prevents it.

Enforcement primitives

Every defensible agent budget is the disjunction of at least the following predicates. A harness that ships only a step cap has shipped 14% of the surface; the rest of this section walks the remaining 86%.

1. Step cap

The simplest predicate: iteration count exceeds N. Every framework ships this with a default; Vercel AI SDK’s stepCountIs(20) is the documented default, the OpenAI Agents SDK’s maxTurns defaults to 10 and raises MaxTurnsExceededError when exceeded, LangGraph’s recursion_limit defaults to 25 and raises GraphRecursionError on breach. Each is a wrapper over the same if step >= cap: abort check.

The trap is treating the default as critical. The Vercel default of 20, OpenAI’s 10, LangGraph’s 25; these are calibration choices the framework author made for their median user, not for your workload. A research agent that needs 80 turns will hit MaxTurnsExceededError long before completing a real task; a customer-service agent with a step cap of 80 will burn a quarter of an hour on the wrong path before the cap fires. Tune the step cap to your workload’s 95th-percentile completion turn count plus a safety margin. Run a sample of real tasks, observe the distribution, set the cap at p95 + 20%. The cap is the fork-bomb backstop, not the typical termination point. The typical termination is stop_reason != "tool_use" (the model itself stopping). If the cap is firing often, the cap is wrong or the model is stuck and you need finer-grained detection (sections 6 and 7 below).

2. Wall-clock deadline

The next predicate: time.monotonic() - started > deadline_s. Step cap is the count limit; deadline is the time limit, and they fire under different failure modes. A model that makes one slow tool call per step can reach step 5 and have burned 60 seconds; a model that makes a hundred fast calls can stay under a 60-second wall-clock but exceed a 50-step cap. Both deserve to terminate.

The discipline that matters: the deadline is per-run, not per-call. A 60-second per-call deadline on a 50-step run gives the agent 50 × 60 = 3000 seconds; fifty minutes; before either limit fires. The right shape is total_deadline = 60s; per_call_deadline = min(remaining_budget, max_per_call). The remaining budget shrinks monotonically; the per-call deadline tightens as the run ages, so the last step gets less wall-clock than the first. This is the same shape as gRPC’s deadline propagation: the upstream’s deadline is the ceiling on every downstream call, with attenuation for the time already consumed.

The wall-clock deadline interacts with the streaming surface from the streaming article. Cancellation must (a) close the upstream connection so the provider stops decoding (and billing), (b) abort in-flight tool executions whose results will never be used, (c) flush partial state to the conversation log. A deadline that fires but doesn’t propagate cancellation is the worst of both worlds. The run “stops” by Python’s metric, but the provider is still decoding the last response and the bill keeps climbing.

3. Token ceiling

The third predicate: cumulative input + output tokens exceed max_tokens_total. Tokens are the proxy for both cost and context blow-up. The trap from the LeanOps incident analysis: a 4,000-token initial context doubling at each step reaches 128K at step 5 and overflows the model’s window by step 15. A token ceiling that fires at 200K total caps the worst case at the cost of two-and-a-half full-context API calls; meaningful in dollar terms, decisive in fork-bomb terms.

Token accounting must break down by cache state, per the prompt-caching article and the agent harness anatomy article. A harness that aggregates “tokens used” without separating cache_read_input_tokens from cache_creation_input_tokens will report a number that bears no relationship to the bill. The right shape:

text

1
2
3
tokens_used = input_tokens + output_tokens + cache_read + cache_write
dollars_used = (input × price_in + output × price_out
                + cache_read × price_cache_read + cache_write × price_cache_write) / 1e6

The dollar conversion needs per-provider, per-model, per-cache-tier pricing, and pricing changes quarterly. Pin the pricing-table version into the budget gate’s audit log (the same cost.pricing_version attribute the observability article recommended for spans) so historical comparisons stay coherent across rate changes.

4. Dollar ceiling

The fourth predicate: cumulative dollar cost exceeds max_dollars. Dollars and tokens diverge because the dollar/token ratio depends on cache hit rate, model tier, and the input/output split; see the cost-optimization and model-routing article for the full math. A token ceiling alone doesn’t catch the cost runaway that happens when a model-routing decision escalates the run from Haiku to Opus mid-loop; a dollar ceiling alone doesn’t catch the context-bloat runaway from a token explosion at constant cost-per-token. Run both.

The defensible production cut: per-run dollar ceiling + per-tenant daily dollar ceiling + per-tenant monthly dollar ceiling. The per-run cap stops the $47K incident at $50 instead of $47,000. The per-tenant daily cap stops the legitimate-but-runaway weekend pattern; one developer, $4,200, three days; at $500 instead of $4,200. The per-tenant monthly cap is the financial backstop: the line beyond which the platform team gets paged regardless of cause. These compose; the run aborts when any of them fires. The accounting state lives in Redis or a similar low-latency store keyed by tenant and time window; the budget gate reads the counter before every call.

5. Per-tool quota

The fifth predicate: calls to tool T this run exceed max_calls_per_tool[T]. A model that calls search_web 200 times in 10 steps is doing something wrong; either it’s stuck (no-progress detection should fire) or it’s interpreting the task wrong (the cap should fire as the backstop). Per-tool quotas are the agent equivalent of RLIMIT_NOFILE: a single tool monopolizing the runtime is itself a failure mode, and the per-tool budget is independent of the global step cap.

The right granularity is by tool class, not always by tool name. Mutating tools (charge_card, send_email, delete_record) get tight per-run caps measured in single digits. Read-only tools (search_web, read_file, query_db) get higher caps measured in dozens. The tool-selection-at-scale article covered the namespace discipline; per-tool quotas inherit from the same namespace tree: tools/mutating/* shares one bucket, tools/read/* shares another. Tool-class quotas are easier to keep consistent across deploys than per-tool caps that drift as the tool catalog grows.

6. No-progress detection

The sixth predicate: the last K tool calls were identical. The simplest no-progress detector hashes (tool_name, sorted_args) and looks for repeats in a sliding window; three identical calls in a row is decisive evidence the model is stuck, and no number of additional steps will help. The agent loop article showed the dumb-but-effective Python:

python

1
2
3
last_calls.append((tool_name, json.dumps(args, sort_keys=True)))
if last_calls[-3:].count(last_calls[-1]) >= 3:
    raise RuntimeError("no-progress: 3x repeat")

That’s the floor. The ceiling is oscillation detection. The model alternating between two states without making progress, which the dumb detector misses because no single call repeats three times in a row. The Analyzer/Verifier pattern from the $47K incident is exactly this: A → V → A → V → A → V, where no individual call repeats but the pair does, indefinitely. The fix is to hash the call signature plus the previous call’s signature and look for repeated pairs:

python

1
2
3
4
def is_oscillating(history, window=6):
    if len(history) < window: return False
    pairs = list(zip(history[-window::2], history[-window+1::2]))
    return len(set(pairs)) == 1

Three repeated pairs in six steps is oscillation; six identical alternating calls is the Analyzer/Verifier shape exactly. Production no-progress detectors typically run both predicates; single-call repetition for the dumb-stuck case, pair-repetition for the alternation case, and a more sophisticated entropy-based detector for the long-horizon meltdown case that the long-horizon reliability article covered.

7. External abort signal

The seventh predicate: an external signal sets aborted = true. The other six predicates are reactive; they fire on state the loop owns. The external signal is for state the loop doesn’t own: an operator dashboard hitting a kill switch, an SLO alert firing on a different service, a budget breach on a different run by the same tenant. The Temporal example from the long-horizon reliability article is exactly this pattern; setHandler(abortSignal, () => { aborted = true; }) plus a while (!aborted) check in the loop body. The external signal is the operational pressure relief valve: when the budget gates are calibrated wrong and the run is climbing toward a dollar ceiling that’s higher than it should be, the operator’s kill switch is the last line of defense.

The signal must be checked at every iteration, not just at “natural” boundaries. A loop body that does five tool calls in parallel between checking the abort signal is five tool calls past the operator’s intent. The same shape as a thread that doesn’t check a cancellation token between synchronous operations. The discipline is to interleave the signal check with every step.

Check cheap limits first

The seven predicates are a disjunction; any one firing aborts the run, but the order in which they’re evaluated determines which one gets credited for the abort, and that matters for the audit log. The right order is cheap-and-decisive first:

External abort signal (cheap, decisive)
Step cap (cheap, decisive)
Wall-clock deadline (cheap, decisive)
Dollar ceiling (cheap once usage is tracked, decisive)
Token ceiling (cheap, decisive; typically redundant with dollar but cheaper to compute)
Per-tool quota (cheap, per-tool dictionary lookup)
No-progress detection (cheapest is O(K) hash lookup over history)

Then a debounce: predicates fire before the next call, never after. The check happens at the top of the loop body, and crucially before the API request goes out; checking after the fact is how you accidentally double your spend right at the limit. The cost-accounting and budget-check code paths share state, so both read from the same usage object the API response updated last iteration. The harness anatomy article made the case for shared state across duties; budgets are the critical example.

Choosing limits from observed runs

Step cap and wall-clock deadline cost nothing; counter increments and timer reads, sub-microsecond per turn. Token and dollar accounting cost the small overhead of summing a few integers and a multiplication. Per-tool quotas cost a dictionary lookup and an increment. No-progress detection costs an O(K) scan over the last K tool calls. External signal-checking costs a single atomic read. The total budget-gate overhead is well under a millisecond per iteration, against API call latencies of hundreds of milliseconds to several seconds. The cost of the budget is the cost of forgetting to ship one of the predicates.

The dollar math the cost-optimization article’s framing extends here: the breakeven for adding a budget gate is the cost of one runaway incident divided by the development cost of the gate. A no-progress detector takes an engineer a day; the $47K incident pays for that engineer’s salary for the year. A dollar ceiling enforced server-side takes a week; the $87K incident pays for the week. A full audit log of which gate fired when, with a one-page incident playbook attached, takes a month; the difference between catching the next runaway at $500 and at $50,000 pays for the month several times over. The budget surface is the single highest-leverage piece of harness work in the production stack, measured in dollars per engineer-hour spent.

The framing the rest of the platform tooling supports; billing alerts at OpenAI, Anthropic’s quota and rate limit settings, the Anthropic pricing page and its quota-management dashboards; provides the financial backstop above the harness gate. They are not substitutes for the harness gate; the platform-level cap fires after the platform has run for hours, the harness-level gate fires before the next call. Both. Always both.