The Agent Loop: ReAct and Its Descendants
How the agent loop actually works: ReAct's thought/action/observation cycle, plan-and-execute, stopping conditions, and the leader-election parallel.
A code-review bot ships on a Friday. The first PR it sees is a 12-file refactor; the bot reads the diff, calls the test runner, calls the linter, fetches the failing test source, calls the test runner again, opens another file, calls the linter again, opens another file. Forty minutes and 3,800 tool calls later it has not produced a single comment. The on-call kills it and the post-mortem is short: there was no stopping condition. The bot kept finding “one more thing to check” because every observation suggested a new action and nothing in the loop ever said “you have enough.” The model did not malfunction; the loop did. Welcome to agents, where the model is the easy part and the loop around it is the entire engineering surface.
Opening bridge
Yesterday’s piece on constrained decoding closed out the Generation Control subtree — the last layer of “make the model emit the right shape.” With structured output, tool use, streaming, prompt caching, and constrained decoding all in place, the building blocks of a single inference call are complete. Today we step up one level: chain those calls together and let the model decide what to call next. That is the agent loop, and it is the runtime that sits underneath every agent product you’ve used. The retrieval cascade from the RAG subtree and the tool-use protocol are both ingredients; the loop is the kitchen.
What an agent actually is
The clearest working definition is Simon Willison’s, distilled in September 2025 after a thousand confused arguments: “An LLM agent runs tools in a loop to achieve a goal.” Three pieces, all load-bearing.
- LLM. The decision-maker. Picks the next tool call, interprets the result, decides when to stop.
- Tools in a loop. Not one call, not one tool — repeated invocations where each result feeds the next decision. This is the part that separates an agent from a glorified function call.
- To achieve a goal. There’s a terminal condition. Without it you have an infinite loop, not an agent.
Lilian Weng’s earlier framing — “Agent = LLM + memory + planning skills + tool use” (June 2023) — adds the two ingredients that the loop sits on top of: memory (the conversation history is the cheapest, and the only one most agents need) and planning (which may be implicit in the loop or explicit as a separate step, see below).
Frame it operationally and the loop is a control structure with five moving parts: the driver (your code) calls the model with the context (prompt + history + tool schemas); the model emits tool calls that the driver executes against a tool surface; results splice back into context; loop. Every word in that sentence has its own engineering decisions. Most production failures are in the driver, not the model.
ReAct: the canonical loop
The pattern that nailed the loop down for LLMs is ReAct (Yao et al., 2022) — Reasoning + Acting. The model alternates three step types:
- Thought. Free-text reasoning about what to do next. The model talks to itself.
- Action. A tool invocation with typed arguments — the structured request covered in the tool-use article.
- Observation. The tool result, fed back into the context for the next thought.
ReAct’s original framing was prompt-only — you few-shot the model with examples of Thought:/Action:/Observation: triples and it learns the rhythm. Modern LLMs internalize this without examples; the prompt-only framing has been absorbed into the post-training data of every frontier model. What ReAct contributed wasn’t a prompt template — it was the insight that explicit reasoning steps between actions sharply improve tool-using accuracy. On HotpotQA the original paper saw the model overcome hallucination by reasoning aloud about what it had observed and what it still needed; on ALFWorld and WebShop the gains over imitation learning were 34 and 10 absolute percentage points respectively.
The mechanical artifact of ReAct in 2026 is the structured assistant turn: a text block containing the thought, followed by one or more tool_use blocks containing the action. Anthropic’s tool-use API bakes this in — the response with stop_reason: "tool_use" is exactly one ReAct iteration. Set extended_thinking and the thought becomes a separate thinking block; the structure is the same. OpenAI’s tool calls have the same shape with different field names.
The distributed-systems parallel
The agent loop maps cleanly onto two distributed-systems primitives, and they explain different failure modes.
The first is leader election. In a Raft cluster, one node is elected leader and serves all writes; followers replicate; on leader failure, a new election picks a successor. The LLM in an agent loop is the leader of a one-node cluster — it issues every decision, every step. The “followers” are the tools: stateful services it consults but does not run. The election analogue shows up in two places. First, when you put a second LLM in the loop (a critic, a planner, a router), you’ve introduced multi-leader semantics and need to decide who breaks ties. Second, the loop’s stopping condition is the equivalent of a leader’s term — without one, the leader never steps down. Most agents are single-leader by construction; multi-agent systems are explicitly multi-leader, with the consensus problem that brings.
The second is the request-reply RPC chain with retries. The driver is an RPC client; every tool call is a typed remote procedure (with the gRPC-with-a-flaky-caller parallel from yesterday); the loop is the chain. The failure modes that beat distributed RPC are the same ones that beat agents: timeouts that aren’t propagated end-to-end, retries that double-fire side effects, cascading failures when one tool degrades and the model keeps hammering it. The mitigations are the same too — wall-clock deadlines on each step, idempotency keys on every mutating tool, circuit breakers on the loop, not just on the tools.
The third parallel, less often noticed but worth naming: continuation-passing style. Each turn of the loop is a CPS frame. The model emits a call and yields; the driver executes the body; the next call to the model is the continuation, with the result spliced into context. The conversation history is a serialized call stack. This framing is what makes conversation compaction hard — you cannot drop a frame from the middle and have the stack still validate; the deep dive on the summarization mechanics is in the context-compression article, and the harness-level orchestration (reactive vs preemptive triggers, cache-aware deletion, circuit breakers) is in the conversation-compaction article.
Mechanics: one full iteration
Walk through a single ReAct iteration end to end. The driver holds three pieces of state: a messages list (the conversation), a tools array (the schemas), and a step counter.
- Call the model. The driver POSTs
messagesandtoolsto the provider. With prompt caching configured on the tool block and system prompt, this call is cheap; without it, you re-prefill the tools on every step. - Receive an assistant turn. The response is a list of content blocks. The driver appends the entire content array back to
messages— the API expects byte-identical blocks on the next call. This is the part most home-grown loops botch: stripping the thought blocks and keeping only the tool calls breaks future-turn coherence on every frontier model. - Inspect
stop_reason. If it isn’t"tool_use"(or the OpenAI equivalent"tool_calls"), the loop is done — return the assistant text. - For each
tool_useblock, dispatch. Look up the tool, validate args (or trust strict-mode constraints), execute. Capture the result, the success/error flag, and the duration. - Append a user turn of
tool_resultblocks. Each must carry the matchingtool_use_id. Errors go in the same shape withis_error: true— the model recovers cheaply from a typed error message; raising a Python exception out of the loop is almost never what you want. - Increment the step counter. Compare against the cap. If exceeded, abort with whatever partial progress you have.
That’s it. Every production agent loop in existence is some elaboration of this six-step body. The elaborations are where the interesting decisions live.
Stopping conditions that actually halt
The single load-bearing decision in an agent loop is when to stop. The model can’t be trusted to know — its incentive at each step is to “be more helpful,” and one more tool call always looks helpful from inside the loop. The driver owns the brakes.
A defensible stopping condition is a disjunction of cheap predicates evaluated after each step:
stop_reason != "tool_use". The natural exit — the model didn’t ask for another tool. This is the only stopping condition the model fires; everything else is the driver.- Step cap. Hard maximum number of iterations, no negotiation. The Vercel AI SDK’s default is
stepCountIs(20); most production agents tune this between 5 and 50 depending on task variance. - Wall-clock deadline. Total time across the loop. Critical for user-facing chat where p99 latency matters; less critical for batch agents.
- Token budget. Sum of input + output tokens across all calls. The cheapest predicate to evaluate, the easiest to forget.
- Dollar budget. Same shape as tokens, denominated in money. Worth maintaining as a separate counter because the dollar/token ratio differs across cached vs uncached input and across models.
- No-progress detection. Did the last N steps make distinguishable changes to state? Two identical tool calls in a row is a strong signal the loop is stuck; three is decisive. The simplest implementation is hashing the
(tool_name, args)tuple and looking for repeats. - Goal predicate. “Has the model called the
submit_final_answertool?” — a tool whose only purpose is to terminate the loop. Vercel’shasToolCall(name)is exactly this pattern.
The disjunction matters. Step cap alone is a hammer that takes effective work and partial results with it; a goal predicate alone never fires for a confused model that hallucinates the wrong-named tool. Compose them: stop on the first of (stop_reason, step cap, deadline, budget, no-progress, goal). The fork-bomb parallel from operating systems is exact — every kernel needs an OOM killer, every agent needs an enforced budget. A dedicated article later in the curriculum will go deeper on budgets and runaway prevention.
Plan-and-execute: the batching alternative
ReAct’s signature trade-off is one LLM call per tool call. If a task needs 12 tool invocations, that’s 12 model round-trips. Each call re-prefills the conversation (modulo prompt caching) and re-pays the network round-trip. For long, predictable tasks this is wasteful — the model has already figured out steps 2-12 by the time it emits step 1, and re-thinking each step from scratch costs latency and money. The full trade-off — when plan-and-execute beats reacting, the cost-of-replanning math, and the in-between architectures (ReWOO, LLMCompiler, Tree-of-Thoughts) — is the subject of the next article in this subtree.
Plan-and-execute inverts the loop. First call: ask the model to produce a step-by-step plan as a structured object (a JSON list of typed steps). Then execute the steps sequentially — either with smaller, cheaper executor calls or with raw code, depending on how rigidly the plan can be specified. Optionally re-plan after each step or on failure. LangChain’s plan-and-execute writeup is the canonical reference; the original research lineage runs through Plan-and-Solve (Wang et al., ACL 2023).
The advantages and disadvantages are mirror-image to ReAct’s:
| ReAct | Plan-and-execute | |
|---|---|---|
| LLM calls per task | One per step | One plan + one per step (executor can be smaller) |
| Adaptability | High — every step replanned implicitly | Low — plan is fixed until re-planning |
| Quality on well-specified tasks | Variable — model can drift mid-task | Better — plan forces whole-task reasoning |
| Quality on ill-specified tasks | Better — reacts to surprises | Worse — plan may be wrong; replanning is expensive |
| Latency | Higher — serial round-trips | Lower — executor can use smaller model |
| Debuggability | Hard — reasoning is interleaved | Easier — plan is a typed artifact |
| Token cost | High — model + tools re-prefilled | Lower — executor sees less |
The practical answer in production is hybrid. Use ReAct for the high-variance “figure out what to do” phase; switch to plan-and-execute once the model has converged on a sequence; re-enter ReAct on errors. Reflexion (Shinn et al., NeurIPS 2023) layered an explicit self-reflection step on top of either pattern, where after a failed trial the agent writes a “lessons learned” text into memory before the next attempt — verbal reinforcement without weight updates. Planning Agents vs Reactive Agents is the next article in this subtree and opens up that comparison in full, including the Tree-of-Thoughts and search-style variants; today it’s enough to know plan-and-execute and reflection are the two natural variations on the base ReAct loop.
Code: a ReAct loop in Python with the Anthropic SDK
A research-assistant agent with three tools: web search, document fetch, and final answer. Install: pip install anthropic. Uses the Anthropic SDK.
| |
Three things to notice. First, the budgets are checked before the API call — checking after the fact is how you accidentally double your spend right at the limit. Second, the no-progress detector is dumb but effective: a hash of the last three (tool_name, args) tuples. If they’re identical, the model is stuck in a loop and no amount of additional steps will help. Third, submit_final_answer is the only “happy path” exit other than the model declining to call a tool — having an explicit terminator tool is much easier to test and reason about than relying on the model to know when to stop emitting tool calls.
The step cap (max_steps=12) is the floor of the abstraction. Without it, a confused model with access to search_web will happily search forever for any sufficiently vague question. The cap is the fork-bomb backstop; the deadline and the no-progress detector are finer-grained nets above it.
Code: a TypeScript loop with the Vercel AI SDK
The Vercel AI SDK wraps the loop inside generateText. The interesting decisions move into the stopWhen clause. Install: npm install ai @ai-sdk/anthropic zod.
| |
The stopWhen array composes — any predicate returning true terminates. stepCountIs(12) is the hard cap; hasToolCall("submitFinalAnswer") is the goal predicate; the inline closure handles wall-clock, token budget, and no-progress in one place. Compare to the Python version: the structure is identical, the wrapper just hides the loop body. The Vercel SDK’s tool() helper handles the tool_result plumbing for you; the loop itself is genuinely one line of configuration.
One operational note: the SDK auto-executes every tool by default, including submitFinalAnswer. If you’d rather treat the final-answer tool as a sentinel without running its body, return early in execute (as above) or use the SDK’s lower-level streaming primitives to intercept the tool call before execution.
What the harness owns
A useful frame: the agent loop has two layers, and the harness — the runtime that drives the model — owns the bottom layer entirely.
- Top layer (model decisions): which tool, with what arguments, when to stop emitting tool calls. The model’s job.
- Bottom layer (harness duties): budget enforcement, retry policy, observability, error transformation, prompt-cache hygiene, message-list construction, streaming, cancellation. Always the driver’s job.
Confusing these is the most common architecture mistake. Asking the model to “please respect the budget” in the system prompt does not actually enforce a budget; the model has no way to add up tokens across calls. Asking it to “stop if you’ve called the same tool twice” does not actually prevent loops; the model has no memory of prior turns it can compare against deterministically. The harness sees everything the model can’t and enforces what the model shouldn’t be trusted to. The agent harness anatomy article walks through the seven duties — context assembly, tool dispatch, streaming, prompt-cache management, error recovery, cost accounting, telemetry — as one integrated runtime; for today, the slogan is the model is the policy, the harness is the kernel. Reading Anthropic’s “Building effective agents” (December 2024) is the right primer for the build-vs-buy discussion on harness choice; it sets the simplicity-first defaults the rest of this subtree builds on.
Trade-offs, failure modes, gotchas
The loop is the bottleneck on latency. Every step is a network round-trip plus a model call plus a tool call. Even with prompt caching covering the input side, a 10-step task with a 1-second per-step floor is 10 seconds of latency — visible to a human user, costly in a batch pipeline. Mitigations: parallel tool calls when the calls are independent (most providers default to “concurrent unless you opt out”), plan-and-execute for predictable sequences, and a smaller/faster executor model behind a stronger planner. The streaming article covered how to surface partial progress to the user; do that early in the loop, not at the end.
The token bill is the loop’s other bottleneck. Without prompt caching, every step re-prefills the entire conversation. By step 10 of a 12-step agent, the prompt has grown to include nine prior tool calls and nine prior tool results. At 100k tokens that’s a substantial prefill on every turn. The mitigations are caching (covered), conversation compaction, and JIT-only context fetches (covered) so the conversation doesn’t carry retrieved material the model no longer needs.
Tool selection collapses past ~30 tools. Once the tool count grows, the model’s pick-the-right-tool accuracy drops faster than you’d expect — descriptions interfere, the schema serialization eats your token budget. The standard fix is two-stage tool retrieval: embed the descriptions, retrieve a top-k subset per turn, only pass those to the model. MCP and dynamic tool routing are the architectural answers; the dedicated article in this subtree walks through deferred loading, retrieval, and namespacing in depth.
Hallucinated tool names and unknown arguments. The model occasionally invents tools that don’t exist, or calls a real tool with an argument outside its schema. Without strict mode this is a runtime error; with strict mode constraints the wrong tool name simply can’t be emitted. Either way, the driver’s job is to translate the failure into a tool_result with is_error: true and a precise error message — never throw out of the loop. The model is very good at recovering from “tool X doesn’t exist, valid tools are Y, Z” on the next turn.
Forgetting to append assistant turns verbatim. The single most common home-grown-loop bug: stripping the text block, or the thinking block, or the unused half of a parallel tool call, when appending the assistant message back to messages. The API expects what it sent; modifying the assistant turn between calls causes coherence loss, sometimes silently. Append the whole resp.content list.
Replay and idempotency. If you replay the loop from a checkpoint, you’ll re-execute every tool that was called between the checkpoint and the failure. Mutating tools (charge_card, send_email, update_ticket) need idempotency keys at the runtime layer, not at the model layer. The model is an at-least-once caller; design every state-changing tool to absorb that. The full saga-pattern treatment of compensations, between-node vs durable checkpointing, and when to abort vs retry is in the long-horizon reliability article.
Streaming vs non-streaming inside the loop. The user-facing chat case wants streaming end-to-end: stream the model’s response, surface partial text immediately, route tool calls as they materialize from the input_json_delta stream. The batch case wants non-streaming for simpler bookkeeping. Picking the right mode at the API boundary is a one-line decision but affects the entire harness design — the streaming path needs partial-JSON parsers and cancellation propagation; the non-streaming path doesn’t.
tool_choice: "any" is a different control structure. Forcing a tool call on every turn collapses the loop into a sequence of mandatory actions and changes the model’s behavior — it can no longer “ask a clarifying question” or “explain its reasoning before acting.” Useful for short, fully-typed pipelines; wrong for open-ended agents.
The model can lie about progress. A confused model will sometimes call submit_final_answer with a plausible-sounding but wrong result. The loop has no way to detect this from inside; you need a separate eval, a verifier (a second model or a deterministic check), or human review for high-stakes outputs. This is where evaluation and the LLM-as-judge pattern (covered later in the curriculum) start to matter, and where simple agent loops graduate into evaluator-optimizer workflows.
Observability is non-negotiable. Every step needs a trace: tool name, arguments, latency, success/error, tokens in/out, cumulative cost. Without traces, debugging a 12-step loop is hopeless — you can’t tell from the final answer which step went wrong. The Production & Operations subtree will cover tracing in depth; for now, log every step’s full payload to disk, behind a flag, and read it when things break.
Further reading
- Simon Willison — “An LLM agent runs tools in a loop to achieve a goal” — the definition-fixing post that finally pinned down what “agent” means in the post-2024 sense. Short, sharp, and the working vocabulary the rest of the field has converged on.
- Anthropic — “Building effective agents” — the December 2024 engineering writeup that distinguishes workflows from agents and walks through the prompt-chaining, routing, parallelization, orchestrator-workers, and evaluator-optimizer patterns. Read this before reaching for a framework.
- Lilian Weng — “LLM Powered Autonomous Agents” — the 2023 survey that fixed the planning/memory/tools decomposition and is still the cleanest concept map for the agent surface.
- LangChain — “Plan-and-Execute Agents” — the canonical writeup on the planner/executor split, with the LangGraph implementation reference. Read alongside the Plan-and-Solve paper (Wang et al., ACL 2023) for the research lineage.
What to read next
- Planning Agents vs Reactive Agents — the direct sequel. Picks up where the plan-and-execute section above stopped: when each architecture wins, the cost-of-replanning math, the in-between architectures (ReWOO, LLMCompiler, Tree-of-Thoughts), and runnable planner/executor splits in Python and TypeScript.
- Tool Selection at Scale: MCP and Dynamic Routing — what happens to the loop when the tool count grows past ~30: selection accuracy collapses, token cost blows out, and the loop driver has to start retrieving tools instead of passing all of them on every turn.
- Computer Use and Browser Agents — the action-surface variant of the loop. The screenshot-in / click-out loop and the DOM-accessibility-tree loop are the same iteration structure as ReAct, with the typed-tool action surface replaced by pixels or by accessibility nodes; the failure modes (coordinate drift, screenshot bloat, prompt injection at the OS boundary) are specific to the action surface, not the loop.
- Long-Horizon Task Reliability — what the loop looks like when it runs for hours, not seconds. The budgets, no-progress detection, and idempotency primitives from this article all earn their keep there; checkpointing, the saga pattern, and the abort-vs-retry decision are the new primitives.
- Anatomy of an Agent Harness — the runtime that wraps everything in this article. The seven harness duties (context assembly, tool dispatch, streaming, prompt-cache management, error recovery, cost accounting, telemetry) become one integrated kernel; the build-vs-buy decision turns on whether the framework’s calibration assumptions match your workload.