jatin.blog ~ $
$ cat ai-engineering/multi-agent-orchestration.md

Multi-Agent Orchestration

Supervisor, swarm, and hierarchical multi-agent patterns: the A2A protocol, split-brain failure modes, the 15x token tax, and when not to reach for it.

Jatin Bansal@blog:~/ai-engineering$ open multi-agent-orchestration

A research-assistant agent answers “list every public board director appointed in 2024 across the Fortune 100.” A single ReAct loop runs it: search, click, read, search, click, read. Twelve minutes in, the model has 18 names and a context window that’s 70% biographical noise about the first three CEOs. It quietly drops earlier observations and forgets companies it already covered. The next day the same task ships to a multi-agent system: a lead agent decomposes the prompt into 100 per-company subtasks, fans them out to 100 small subagents in parallel, each subagent owns one company’s 8k context window, and a synthesizer merges the results. Three minutes wall-clock, 478 names, no long-horizon drift. It also burned roughly 15× the tokens of a single chat conversation — Anthropic’s own number for their multi-agent research product. Multi-agent orchestration is what you reach for when the bottleneck isn’t reasoning quality — it’s context bandwidth and parallelism. It is also what most teams reach for far too early.

Opening bridge

Yesterday’s piece on planning vs reactive agents walked through the single-agent control-flow decision: when to plan ahead, when to react step-by-step. Both architectures had the same shape — one model deciding, one loop running. Today we change the shape: more than one model in the loop, each with its own context, talking to each other through a protocol the harness defines. The agent-loop article flagged this as the “multi-leader” extension of the leader-election parallel and promised it would get its own piece. This is that piece.

What “multi-agent” actually means

The phrase is overloaded. Three patterns get called “multi-agent” and they are not the same thing.

  • Multi-step single-agent. One model, multiple turns, multiple tools. This is just a regular agent. Calling it multi-agent because it makes many calls is a category error.
  • Workflow with multiple LLM calls. A pipeline where different prompts (or different models) handle different stages of a fixed flow. Anthropic’s “Building effective agents” classifies these as workflows, not agents: the control flow is statically defined by the developer, not by the model. Routing, parallelization, and orchestrator-workers from that taxonomy all fit here.
  • True multi-agent system. Multiple agents, each with its own loop, its own context, and its own goal-directed reasoning, communicating through a defined protocol. The control flow is decided at runtime by the agents themselves — who delegates to whom, when to terminate, when to escalate. This is what the rest of the article means by multi-agent.

The bright line is whether each agent has its own loop and its own context window. If your “agents” are sequential prompts on a single conversation history, you have a chatbot with personas, not a multi-agent system.

The three orchestration patterns

There are roughly three orchestration topologies that production systems actually use. Each one trades a different axis.

Supervisor (orchestrator-worker)

One lead agent (the supervisor) receives the user request, decides which specialist worker to invoke, dispatches a subtask, waits for the result, and either calls another worker or returns the synthesized answer to the user. Workers cannot talk to each other; every routing decision flows through the supervisor.

This is the pattern Anthropic uses for their research product and what the langgraph-supervisor library bakes in. It maps cleanly onto a leader-and-followers topology: the supervisor is the leader, workers are stateless functions from the supervisor’s perspective. Communication is request-reply, never peer-to-peer.

Strengths: every routing decision is visible in traces. Easy to add a new specialist (extend the supervisor’s tool list). Workers can be small models because each subtask is narrow. Failure modes are localized — a bad worker doesn’t poison the supervisor’s context.

Weaknesses: the supervisor is a bottleneck (every message goes through it) and a cost center (it pays for the full conversation history while routing). Anthropic reports the multi-agent research configuration burns about 15× the tokens of a single chat; most of that overhead is the lead-agent context.

Swarm (peer handoffs)

Agents hand off control directly to each other. There is no fixed orchestrator; whichever agent currently holds the conversation can transfer it to a peer by emitting a structured handoff. The next agent picks up the same conversation and continues. The pattern was popularized by OpenAI’s Swarm (now superseded by the production-grade OpenAI Agents SDK) and codified as a first-class primitive in the SDK’s handoffs mechanism.

Swarm is cheaper than supervisor because there’s no per-step routing call — the handoff is a tool call inside the active agent’s loop. It’s also more adaptive: agents that detect mid-task that they’re the wrong specialist can re-route themselves without a round-trip to the supervisor.

Weaknesses: control flow is decentralized, which means traces look like a directed graph of handoffs rather than a single thread. Loop detection is harder (agent A hands off to B which hands back to A which hands off to B again). And there’s no single point that owns the termination condition — every agent has to enforce it or the swarm runs forever.

Hierarchical (supervisor of supervisors)

Stack supervisor patterns: a top-level supervisor routes to sub-supervisors, each of which routes to its own pool of workers. Used in production when the task space genuinely decomposes — e.g., a top-level “ops” supervisor with subordinate “billing,” “support,” and “engineering” supervisors, each managing their own specialists.

This is rarely the right starting point. The hierarchy multiplies the routing-overhead tax: every step pays the supervisor cost twice. Reach for it only when (a) the specialist count exceeds ~10 and a single supervisor’s tool list is unwieldy, or (b) different sub-domains need different routing policies. Otherwise a flat supervisor with a router-style tool selection layer is simpler.

The distributed-systems parallels

Three analogies, each illuminating a different facet.

Leader-and-followers vs gossip protocols. The supervisor pattern is a Raft-style single-leader topology: the leader serializes all decisions; followers do not communicate. The swarm pattern is closer to a gossip protocol or a service-mesh sidecar topology: peers exchange messages directly, the topology is the wiring of who-knows-whom, and there’s no single arbiter. The reliability story differs accordingly. Supervisor failures are catastrophic (no leader, no progress) but rare and easy to recover from. Swarm failures are local but harder to diagnose — agent A may “succeed” but pass garbage to agent B, and the garbage propagates.

Split-brain in multi-agent systems is the same disease as in distributed databases. Two agents working on the same shared state (a shared document, a shared scratchpad, a shared task queue) can both believe they are the source of truth and overwrite each other. The fix is the same as in databases: either elect a single writer (the supervisor pattern), use conflict-free replicated data types (CRDTs) on the shared state, or define explicit conflict-resolution rules. The “blackboard architecture” — a database-and-mediator approach that goes back to Hearsay-II in the 1970s and got a 2025 LLM-era revival — is the disciplined fix; the swarm pattern usually punts and hopes the handoff protocol is structured enough that conflicts don’t happen.

The dispatcher/worker queue. A supervisor-with-workers system is structurally identical to a job-queue architecture: the supervisor is the dispatcher, each worker is a queue consumer, the handoff is the enqueue, the result is the response. Many of the lessons from that world transfer directly: worker pools should be sized for the actual concurrency, slow workers should not block the dispatcher, retries on worker failure need to be bounded, and the dispatcher’s queue itself needs backpressure. Anthropic’s published failure mode — “spawning 50+ subagents for a simple query” — is a runaway fork-bomb, the same bug a misconfigured job queue exhibits when its rate limiter is missing.

The communication protocols

Two protocols matter in 2026, and they sit at different layers.

MCP (Model Context Protocol) is the agent-to-tool protocol. An agent connects to an MCP server, discovers the tools the server exposes, and invokes them. Almost all multi-agent systems use MCP under the hood — the worker tools, the search tools, the file-system tools — but MCP itself is single-agent: there’s no notion of one MCP-using agent being a tool for another MCP-using agent. The tool-selection-at-scale article covers MCP as a discovery layer and the dynamic-routing patterns that go with it; the framing here is just: agent-to-tool, not agent-to-agent.

A2A (Agent-to-Agent Protocol) is the agent-to-agent protocol that matters now. Originally introduced by Google in April 2025, donated to the Linux Foundation, and reaching v1.0 in early 2026. A2A defines three primitives:

  • Agent Cards. A machine-readable document each agent publishes describing its identity, capabilities, skills, endpoint, and authentication. This is how one agent discovers what another can do — analogous to a protocol buffers service descriptor.
  • Tasks. The unit of work exchanged between agents. Tasks have a lifecycle (submittedworkingcompleted / failed / canceled / rejected), a unique ID, and structured input/output. Tasks are how a supervisor “delegates” to a worker over the wire.
  • Transport. JSON-RPC 2.0 over HTTP, Server-Sent Events for streaming, optional gRPC bindings. The transport choice is per-binding; the semantics are identical.

A2A and MCP are complementary, not competitive: an agent uses MCP to call tools, A2A to call other agents. As of mid-2026, A2A has 150+ supporting organizations including Google, Microsoft, AWS, Salesforce, and IBM, and is the closest thing to a cross-vendor standard for multi-agent communication.

For systems entirely inside one process or one company, you don’t need either. A function call, a typed message bus, or a shared LangGraph state object works fine. A2A becomes necessary when agents cross trust boundaries — your agent calling a third party’s agent across the internet — exactly the boundary where you also need authentication, capability negotiation, and versioning.

When multi-agent pays off

The honest answer is: rarely, but when it does, the win is large. Anthropic’s research-system writeup reports their multi-agent configuration outperformed a single Claude Opus 4 agent by 90.2% on internal evaluations — but burned roughly 15× the tokens of a chat conversation. Token usage alone explained about 80% of the performance variance in their browsing tasks. The pattern is unambiguous: multi-agent buys you parallelism and context isolation, and you pay for it linearly in tokens.

Four signs multi-agent is the right tool:

  • The task is breadth-first. Independent sub-investigations that can be run in parallel and merged at the end. Research, due-diligence, fan-out scraping, cross-document comparison. Each subagent owns its own context window; the total context budget across the system scales linearly with subagent count.
  • The context window is the bottleneck. When a single conversation can’t fit all the information needed for the task, multi-agent gives each subagent a fresh 200k-token window. The lead agent only needs the summarized findings.
  • Specialization actually matters. Different sub-domains need different system prompts, different tool sets, different temperature settings, or different model sizes. A coding agent and a research agent and a database agent really are different programs; trying to be all three from one system prompt under-fits all three.
  • The token budget is not the constraint. If you’re operating at scale where 15× cost is acceptable for a 2× quality lift, multi-agent makes sense. If you’re rate-limited on dollars per request, it doesn’t.

Four signs it’s the wrong tool:

  • The task is sequential. Step N depends on step N-1. No parallelism to exploit. Multi-agent gives you all of the routing overhead and none of the bandwidth benefit. Use plan-and-execute instead.
  • You haven’t squeezed the single-agent. Most “we need multi-agent” requirements turn out to be “we need better tool descriptions, better stopping conditions, and better prompt caching.” Burn-down the single-agent’s failure modes first.
  • Trust boundaries don’t exist. If every “agent” is in your codebase, sharing the same memory and the same database, the agents are just functions with extra steps. Three function calls is cheaper than three agents.
  • Observability isn’t ready. Multi-agent traces look like graphs, not threads. If your tracing tool can’t render an agent DAG with timing and token costs per node, debugging will eat you alive. Production tracing for LLM systems gets its own piece in the Evaluation subtree.

Code: a supervisor pattern in Python with the OpenAI Agents SDK

A research supervisor with two worker agents: a web-search specialist and a database specialist. The supervisor receives a question, hands off to whichever specialist is relevant, and returns the synthesized answer. Install: pip install openai-agents. Uses the OpenAI Agents SDKhandoffs is the SDK’s first-class swarm primitive, but a single hub agent that holds handoffs to all peers gives you supervisor semantics.

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
import asyncio
from agents import Agent, Runner, function_tool

# --- specialists: each is a regular agent with its own tools and instructions ---

@function_tool
def web_search(query: str) -> str:
    """Run a web search and return the top results as text."""
    # In real code: requests.get("https://api.tavily.com/search", ...)
    return f"[stub] top 3 web results for: {query}"

@function_tool
def sql_query(query: str) -> str:
    """Run a read-only SQL query against the internal warehouse."""
    # In real code: psycopg.connect(...).execute(query)
    return f"[stub] rows for: {query}"

web_agent = Agent(
    name="web_researcher",
    instructions=(
        "You are a web research specialist. Use the web_search tool to "
        "answer questions about public information, news, and competitor data. "
        "Return a concise factual summary with citations. Do not speculate."
    ),
    tools=[web_search],
    model="gpt-5-nano",   # small model; per-step reasoning is narrow
)

db_agent = Agent(
    name="db_researcher",
    instructions=(
        "You are an internal data specialist. Use the sql_query tool against "
        "the warehouse to answer questions about our own customers, orders, "
        "and metrics. Schema doc lives in the system prompt; never invent tables."
    ),
    tools=[sql_query],
    model="gpt-5-nano",
)

# --- supervisor: the routing brain. Holds handoffs to both specialists. ---

supervisor = Agent(
    name="supervisor",
    instructions=(
        "You are the routing supervisor for a research assistant. For each user "
        "question, decide which specialist to invoke. Web questions go to "
        "web_researcher; questions about internal data go to db_researcher. "
        "Some questions need both — invoke them in sequence, then synthesize. "
        "Never answer factual questions directly; always route. Stop when you "
        "have an answer the user can act on."
    ),
    handoffs=[web_agent, db_agent],
    model="gpt-5",   # the supervisor is the leader; pay for the big model here
)

async def main():
    result = await Runner.run(
        supervisor,
        "How does our Q1 churn rate compare to our top three competitors' public numbers?",
    )
    print(result.final_output)

asyncio.run(main())

Three things to notice. First, handoffs are tool calls, not chat messages. The SDK injects each handoff target as a synthetic tool on the supervisor; the supervisor “transfers” by emitting a tool_use block, and the SDK swaps the active agent. This is why handoffs trace cleanly. Second, the model split matters. The supervisor runs on the big model because routing is the load-bearing decision; the workers run on a small model because per-step reasoning is narrow. Anthropic’s published lead/Opus-subagent/Sonnet split is the same idea. Third, the conversation history is shared by default. The new agent sees what the previous agent saw, which is what makes the handoff feel like a relay rather than a fresh start. For true context isolation (one agent per fresh window), structure the system as Runner.run(...) per subagent and have the supervisor synthesize over the returned outputs — the orchestrator-worker pattern proper.

Code: a supervisor in TypeScript with LangGraph

The same pattern in LangGraph, exposing the state machine explicitly. Install: npm install @langchain/langgraph @langchain/anthropic @langchain/core zod. Uses the LangChain Anthropic provider.

typescript
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
import { ChatAnthropic } from "@langchain/anthropic";
import { StateGraph, Annotation, END, START, Command } from "@langchain/langgraph";
import { tool } from "@langchain/core/tools";
import { z } from "zod";

// --- worker tools ---
const webSearch = tool(
  async ({ query }) => `[stub] web results for: ${query}`,
  { name: "web_search", schema: z.object({ query: z.string() }) },
);
const sqlQuery = tool(
  async ({ query }) => `[stub] rows for: ${query}`,
  { name: "sql_query", schema: z.object({ query: z.string() }) },
);

const supervisorModel = new ChatAnthropic({ model: "claude-opus-4-7", temperature: 0 });
const workerModel = new ChatAnthropic({ model: "claude-haiku-4-5", temperature: 0 });

// --- graph state: the messages list + which agent is active + a step counter ---
const State = Annotation.Root({
  messages: Annotation<any[]>({ default: () => [], reducer: (a, b) => [...a, ...b] }),
  next: Annotation<string>({ default: () => "supervisor", reducer: (_, x) => x }),
  steps: Annotation<number>({ default: () => 0, reducer: (_, x) => x + 1 }),
});

// --- supervisor node: routes via a typed "next" decision ---
const RoutingDecision = z.object({
  next: z.enum(["web_researcher", "db_researcher", "FINISH"]),
  reason: z.string(),
});

async function supervisorNode(s: typeof State.State) {
  if (s.steps >= 10) return { next: "FINISH" };   // hard cap on routing turns
  const router = supervisorModel.withStructuredOutput(RoutingDecision);
  const out = await router.invoke([
    { role: "system", content: "Route the conversation to the right specialist or FINISH." },
    ...s.messages,
  ]);
  return { next: out.next };
}

// --- worker nodes: each runs its tool and appends its findings ---
async function webNode(s: typeof State.State) {
  const llm = workerModel.bindTools([webSearch]);
  const out = await llm.invoke([
    { role: "system", content: "You are the web research specialist." },
    ...s.messages,
  ]);
  return { messages: [out], next: "supervisor" };   // hand control back
}

async function dbNode(s: typeof State.State) {
  const llm = workerModel.bindTools([sqlQuery]);
  const out = await llm.invoke([
    { role: "system", content: "You are the internal-data specialist." },
    ...s.messages,
  ]);
  return { messages: [out], next: "supervisor" };
}

// --- routing edge: read the supervisor's structured decision ---
function route(s: typeof State.State) {
  if (s.next === "FINISH") return END;
  return s.next;
}

const graph = new StateGraph(State)
  .addNode("supervisor", supervisorNode)
  .addNode("web_researcher", webNode)
  .addNode("db_researcher", dbNode)
  .addEdge(START, "supervisor")
  .addConditionalEdges("supervisor", route, {
    web_researcher: "web_researcher",
    db_researcher: "db_researcher",
    [END]: END,
  })
  .addEdge("web_researcher", "supervisor")   // workers return to the supervisor
  .addEdge("db_researcher", "supervisor")
  .compile();

export async function ask(question: string) {
  return graph.invoke({ messages: [{ role: "user", content: question }] });
}

The LangGraph version makes the supervisor pattern’s shape explicit: a star topology with the supervisor at the center, every worker edge entering and leaving through it. The hard step cap (s.steps >= 10) is the OOM-killer for routing — the supervisor can spin forever if no worker reports progress, and the cap is the only thing that guarantees termination. Notice also that workers’ next: "supervisor" is what makes this a supervisor pattern rather than a swarm; a peer-handoff variant would set next to another worker name directly, bypassing the supervisor. LangGraph’s Command primitive (imported above but not used here) is the explicit handoff primitive when you want to go that route.

Trade-offs, failure modes, gotchas

The 15× token tax is real and unavoidable. Multi-agent systems pay for the supervisor’s full conversation context plus every worker’s separate context. Anthropic’s published number — agents use about 4× more tokens than chat, multi-agent systems use about 15× more — is roughly the bound for any well-tuned system. There are no architectural tricks that defeat this. Either the value justifies it or it doesn’t.

Specialist count has a sweet spot. Below 3 specialists, the supervisor’s routing decision is trivial and a Python if/elif would do the job. Above 10, the supervisor’s tool list overflows its decision capacity and a router-of-routers hierarchy starts to look attractive. The middle range is where supervisor patterns earn their keep.

Subagent fan-out is a fork-bomb risk. Anthropic flagged a failure mode where their early system spawned 50+ subagents for simple queries. Two mitigations: a hard cap on subagent count enforced by the harness, and explicit prompt instructions in the supervisor telling it the right fan-out for the task class (“1 subagent for fact lookups, 2–4 for direct comparisons, 10+ only for breadth-first research”). The prompt is the fast feedback loop; the cap is the safety net.

Workers can deadlock the supervisor. If a worker emits a handoff back to the supervisor that emits a handoff to the same worker, the system can ping-pong indefinitely with the model “reasoning differently each time” but making no progress. The same no-progress detector from the agent loop article applies: track the last N states, terminate if they’re effectively identical. In multi-agent, the comparison is harder because state lives in many places, but the principle holds.

Context leakage and prompt injection cross trust boundaries. When you delegate to a third-party agent over A2A, you have to assume that agent can be compromised — by a malicious user, by an injection embedded in its tool outputs, by a poisoned model. Any data passed to it should be treated as data passed to an untrusted external service. The supervisor pattern’s centralized routing makes this easier to reason about than swarm handoffs — only the supervisor’s prompt needs to defend against injection-induced re-routing — but the cross-boundary issue doesn’t go away.

Swarm loops are harder to debug than supervisor loops. When agent A hands off to B which hands back to A, your trace looks like a graph traversal, not a linear log. Most LLM observability tools (Langfuse, LangSmith, Phoenix) handle this now, but the cost is that your debugger needs to be the trace viewer; print() won’t cut it. Multi-agent systems are the first agent setup where production observability is required, not optional.

The “lead agent” can become a single point of failure for cost. Every routing decision pays the lead agent’s per-call cost. If the lead is on the largest, most expensive model, your cost graph grows linearly with task complexity even when most of the work is happening in cheap workers. Two mitigations: (a) downgrade the lead model to a mid-tier model once the system is stable (Anthropic and the LangGraph community both report ~35% cost reduction with ~4% accuracy hit from this); (b) cache the lead’s system prompt aggressively with prompt caching so the per-step cost is dominated by output tokens rather than re-prefilling the routing rules.

Memory across agents is its own problem. When one subagent learns something that another subagent needs, the supervisor pattern routes it through the lead’s context; the swarm pattern depends on the handoff payload; the hierarchical pattern depends on the shared state object. Multi-agent shared memory is the dedicated deep dive — it’s the most common reason a working two-agent system fails when you scale to ten, and the consistency questions (when does a write become visible, how are concurrent writes resolved, what’s the audit trail, what’s the deletion path) have to be answered explicitly rather than left to chance.

A2A is still maturing. The protocol hit v1.0 in early 2026 and has broad industry support, but the SDK landscape is uneven. Use A2A when you genuinely need to talk to an agent across a trust boundary; use native LangGraph/Agents-SDK constructs when everything lives inside your stack. Cross-vendor multi-agent is real but it is not the default architecture.

Further reading

  • Anthropic — “How we built our multi-agent research system” — the engineering writeup behind Claude’s deep-research product. Honest about the cost (15× chat tokens), specific about the win (90% over single-agent on internal evals), and the cleanest single source on what actually breaks when you scale a supervisor-with-subagents system in production.
  • Anthropic — “Building effective agents” — the December 2024 piece that drew the workflow-vs-agent line and named the five orthogonal patterns (prompt-chaining, routing, parallelization, orchestrator-workers, evaluator-optimizer). Multi-agent is “orchestrator-workers” in their taxonomy; this article zooms in on that pattern.
  • A2A Protocol specification — the canonical reference for Agent Cards, Tasks, and the wire format. Read this before you commit to a cross-agent integration; the protocol is small enough to internalize in an afternoon and stable enough at v1.0 to bet on.
  • LangGraph Supervisor library — the reference Python implementation of the supervisor pattern, with worked examples for handoffs, message history filtering, and hierarchical (supervisor-of-supervisors) systems. The clearest “look at the code” entry point for the pattern.
  • Planning Agents vs Reactive Agents — the single-agent control-flow piece this article builds on. Multi-agent is the next knob to turn after the planning/reacting decision; many “we need multi-agent” requirements are actually “we picked the wrong control flow.”
  • The Agent Loop: ReAct and Its Descendants — the loop body each agent in the system runs. Multi-agent inherits all of the single-loop failure modes (stopping conditions, retries, idempotency) and adds inter-agent ones on top.
  • Tool Selection at Scale: MCP and Dynamic Routing — the intra-agent version of the routing problem this article solves inter-agent. The hierarchical-pattern flag at the start of the article (“router-style tool selection layer is simpler than nested supervisors”) is what that piece is about.
  • Multi-Agent Shared Memory — the layer this article kept deferring. Shared state across agents is the place where the supervisor/swarm/hierarchical patterns hit their consistency limits; the four shared-memory patterns (supervisor-mediated, shared-block, cross-thread store, blackboard), the four consistency questions, and the CRDT vs CAS vs last-writer-wins trade-offs are the dedicated deep dive.