Multi-Agent Orchestration Patterns and the Real Cost of Coordination

Multi-Agent Orchestration Patterns and the Real Cost of Coordination

The instinct when a multi-agent system underperforms is to add more agents. Another specialist. Another reviewer. Another critic. The instinct is almost always wrong.

There is now enough production data and rigorous benchmarking to say this with confidence: agent count is the wrong variable. Orchestration topology — the structure that connects agents, controls message flow, and manages who makes decisions — is the primary performance lever in any multi-agent system once models reach comparable capability levels. And the coordination overhead of getting this wrong is measured in concrete, uncomfortable numbers.


The Failure Rate Nobody Talks About

The most striking number in the 2025 multi-agent literature comes from the MAST paper (ICLR 2025), which analyzed over 1,600 execution traces across seven popular multi-agent frameworks — ChatDev, AutoGen, MetaGPT, CrewAI, and others. ChatDev’s correctness rate in production was 25%. Not 70%. Not 80%. One in four.

Multi-agent systems are being deployed in production with frameworks that fail three quarters of the time. Understanding why requires looking at where the failures actually come from:

What’s revealing is what isn’t on this list: raw capability failures. The systems aren’t failing because the underlying model isn’t smart enough. They’re failing because of how agents are connected and instructed. Better prompts improve things by at most 14%. Specification and design problems require architectural changes.


When More Agents Actually Makes Things Worse

Google published research in December 2024 (Towards a Science of Scaling Agent Systems, arXiv 2512.08296) that evaluated 180 agent configurations across five benchmark types. The finding: for sequential reasoning tasks, every multi-agent variant tested degraded performance by 39 to 70% compared to a single agent handling the task alone.

For parallel tasks — financial reasoning where sub-problems can be solved independently — centralized multi-agent coordination improved performance by 80.9% over a single agent.

The pattern is sharp. The variable that predicts which situation you’re in isn’t task complexity — it’s task decomposability:

Google’s team built a predictive framework using two input variables: task decomposability and tool count. That framework correctly identified the optimal coordination strategy for 87% of unseen task configurations.


The Coordination Tax Scales Superlinearly

Coordination overhead doesn’t scale linearly with agent count. Research on LLM-Coordination found it grows as O(n^1.4 to n^2.1). Adding your third agent costs more coordination than adding your second. Adding your fifth costs more than your fourth.

This creates a ceiling. At some number of agents, coordination cost exceeds the marginal capacity added by the new agent. The research suggests this ceiling materializes somewhere around eight to ten agents for most task configurations, but depends on how tightly coupled the agents are.

The practical rule: Keep total agent count at 4-5 for a 24/7 system. Each agent should have a non-overlapping file domain, clear termination conditions, and async communication through shared files rather than synchronous calls. At this scale, coordination overhead is below the capability gain threshold.

The token economics confirm this. Anthropic’s production data sets a clear baseline: a single chat interaction is 1x. A single agent with tool calls is 4x. A multi-agent system is 15x. Multi-agent is only economically viable for tasks where the value delivered exceeds 15x the cost of an equivalent chat interaction.


The Five Orchestration Topologies

The literature has converged on five canonical patterns, each with distinct error characteristics, cost structures, and optimal task classes.

1. Hub-and-Spoke (Centralized Supervisor) — Best Reliability

A central orchestrator delegates tasks to specialized workers and synthesizes their outputs. All inter-agent communication routes through the hub.

Error amplification: 4.4x — the hub acts as a circuit breaker, catching errors before they propagate. Compare this to 17.2x for unstructured networks.

Best for: Tasks where errors must not cascade; when workflow structure is clear; when a single authoritative synthesis is required.

2. Pipeline (Sequential) — Best for Reasoning

Agents execute in strict order, each agent consuming the prior agent’s full output. No parallelism. Appears inefficient but is optimal for a large class of tasks.

AdaptOrch selects sequential topology 41% of the time on GPQA Diamond tasks. For reasoning-heavy problems, sequential beats parallel because reasoning requires compounding context across steps — parallelism fragments the chain.

Best for: Sequential reasoning, multi-step analysis, tasks with strict dependencies between steps.

3. Hierarchical (Nested Orchestration) — Best for Complex Tasks

Sub-orchestrators manage clusters of workers; a meta-orchestrator coordinates sub-orchestrators. Requires careful design but scales to genuinely complex workflows.

AgentOrchestra achieved 89.04% on GAIA using hierarchical architecture with a central planner delegating to specialized sub-agents for web search, data analysis, and file operations.

Best for: Complex research tasks, workflows with distinct functional domains, tasks where 3+ specialized capabilities must coordinate.

4. Independent / Bag-of-Agents — Worst Reliability

Parallel agents work independently on sub-tasks and aggregate at the end. No inter-agent communication. Appears efficient and is the default choice for many teams.

Error amplification: 17.2x — worse than every other topology. When tasks are not truly decomposable, independent agents compound each other’s mistakes at scale.

Best for: Genuinely decomposable tasks with no inter-dependencies. Rarely the right choice in practice.

5. Event-Driven Mesh — Best Resilience

Agents communicate via a message queue (Kafka, NATS, etc.) rather than directly. No agent couples to another — agents subscribe to topics and publish results.

Best for: Long-running workflows where agent failures should not abort the entire pipeline; auditable, replayable agent interactions; compliance-sensitive environments.


The Error Amplification Problem

When multiple agents operate independently — without a central orchestrator validating their work — errors amplify at a rate of 17.2x through unchecked propagation. When a central orchestrator reads sub-agent outputs and validates before passing results downstream, this drops to 4.4x.

This connects to a key observation: if a sub-agent writes a subtly wrong file to the shared state — a malformed sitemap entry, an incorrect article link — that error propagates silently into every subsequent decision made from that state. No alarm fires. The main loop reads the file, treats it as correct, and builds on it. The error compounds.

Centralization isn’t bureaucracy in this context. It’s error containment. The orchestrator reading and validating sub-agent output is what drops the amplification factor from 17x to 4x.


The Topology Selection Framework

Given the research, a practical topology selection framework:

Task TypeOptimal TopologyAvoidEvidence
Sequential reasoning (logic chains, analysis)Sequential / Single AgentParallel / IndependentAdaptOrch: 41% GPQA cases; 39-70% degradation with parallel
Independent decomposable sub-tasksHub-and-SpokeBag-of-AgentsGoogle: 4.4x vs 17.2x error amplification
Complex research / multi-domain coordinationHierarchicalFlat centralizedAgentOrchestra: 89.04% GAIA
Long-running workflows with failure toleranceEvent-Driven MeshDirect agent couplingDurable message storage survives agent crashes
Latency-sensitive (P95 < 6s)Single agent or minimal hub-spokeDeep hierarchical pipelines38-46% critical path reduction with explicit latency supervision

AdaptOrch (arXiv:2602.16873, UC Berkeley) builds a formal framework for dynamically selecting among these topologies based on a task dependency graph. Using identical underlying models, topology-aware routing outperforms static single-topology baselines by +22.9% on SWE-bench Verified and +14.9% on GPQA Diamond. The same models. A different connection structure.


Structured Handoffs: The API vs. Memo Distinction

The MAST paper’s inter-agent misalignment category (32.3% of failures) is almost entirely a handoff problem. The most common failure is conversation reset: an agent receives a handoff, resets its context, and starts from scratch — ignoring the handoff state entirely. This happens most often when handoffs are transmitted as free-text summaries.

The analogy that clarifies this: inter-agent handoffs should be treated like a public API, not like a memo. A memo is prose — the reader interprets it, prioritizes it, can ignore parts. An API call is structured — the receiving process parses defined fields and fails explicitly if required fields are missing.

When you pass free-text summaries between agents, you get memo semantics. When you pass structured output — key/value pairs, status codes, explicit “next action” fields — you get API semantics. An outbox entry that says “wrote article X, status: success, path: /var/www/blog/X.html” is more reliable than a paragraph describing the same.


File-Based State Is the Right Pattern

Anthropic’s engineering guidance on long-running agent architecture validates a specific design: an initializer agent creates a progress file and commits it. A coding agent runs each subsequent session by reading git logs and the progress file, executes one bounded task, commits with a descriptive message, and updates the progress file. Git history serves as both versioned state and rollback capability.

The reason this works is structural alignment: LLMs are trained on developer workflows. They’re unusually competent at reading files, following directory structures, grepping patterns. Using a filesystem as shared state isn’t a workaround for something that should be a database — it’s playing to a genuine strength in how these models were trained.

The critical design decision within file-based state: append-only logs for anything that multiple agents might write concurrently, and explicit ownership for anything with single-writer semantics. The failure mode is “silent last-write-wins” — two agents both writing to the same file, with the later write overwriting the earlier one without either agent knowing. Each agent should own a specific directory, with shared writes limited to append-only log files.


Four Orchestration Failure Modes to Avoid

Deadlock

Circular dependency: Agent A waits on B, B waits on C, C waits on A. No error signal is generated — the system silently stalls. The practical solution is designing for acyclic dependency graphs and imposing timeout thresholds on any agent-to-agent wait.

Cascade / Retry Storm

One downstream failure triggers simultaneous retries in multiple upstream agents. The fix is the same pattern used in microservices: bulkhead isolation + exponential backoff on retry chains. Never use “fire and forget” orchestration without circuit breakers.

Context Saturation in Sequential Pipelines

In long sequential pipelines, intermediate results accumulate in the orchestrator’s context window. Early context falls into the attention “lost middle” zone and is effectively discarded. Treat context saturation as a hard pipeline depth constraint, not a soft limit.

History Loss at Handoffs

MAST specifically identifies loss of conversation history and conversation reset as appearing in 37% of inter-agent misalignment failures. When Agent B receives a task from Agent A, if it does not receive the full context of what A decided and why, B operates on incomplete information and produces locally-correct but globally-wrong outputs. Treat agent handoffs as state transfer operations, not just task delegation.


What Production Systems Actually Do

The pattern across Anthropic’s published architecture notes, Google’s Agent Development Kit documentation, and production research is consistent:

Embed scaling rules in prompts, not code. Tell each agent explicitly what effort level is appropriate for what task class. A practical scaling rule: simple fact-check = 1 agent with 3-10 tool calls; direct comparison = 2-4 subagents; complex research = 10+ subagents.

Instrument boundaries, not execution. The useful observability is at handoffs: what did Agent A decide, what did it hand to Agent B, and what did B produce from it? The internal reasoning trace of each agent is less valuable and generates PII risk. Treat delegation as “transfer of authority” — a distinct event to log.

Agents as filters, not generators. Subagents should return ranked result lists to the lead agent rather than synthesized outputs. The orchestrator, with full context, makes the synthesis decision. This preserves optionality and reduces the risk of a subagent producing an authoritative-sounding wrong answer that the orchestrator accepts uncritically.


Frequently Asked Questions

What is the coordination tax in multi-agent AI systems? The coordination tax is the performance overhead from agent-to-agent communication, handoffs, and synchronization. Research shows it scales superlinearly — O(n^1.4 to n^2.1). Adding your fifth agent costs proportionally more coordination than adding your second. Beyond 8-10 tightly coupled agents, coordination cost typically exceeds the marginal capability gained.

Why do multi-agent AI systems fail so often in production? The MAST paper (ICLR 2025) analyzed 1,600+ execution traces and found: design failures (44.2%), inter-agent handoff misalignment (32.3%), task verification failures (23.5%). Fewer than 0.5% of failures were raw model capability failures. The systems aren’t failing because the models are insufficient — they’re failing because of how agents are connected and instructed.

When does adding more AI agents hurt performance? For sequential tasks where each step depends on the previous output, every multi-agent variant in Google’s December 2024 study degraded performance by 39-70% vs. a single agent. For parallelizable tasks with independent sub-problems, multi-agent improved performance by up to 81%. Task decomposability — not complexity — is the determining variable.

How do you reduce coordination overhead in a multi-agent system? Four concrete changes: (1) Keep agent count to 4-5 for 24/7 systems. (2) Replace free-text handoffs with structured formats — key/value pairs, explicit next-action fields. (3) Add a central orchestrator to validate sub-agent outputs before passing them downstream (reduces error amplification from 17.2x to 4.4x). (4) Use file-based state with explicit ownership — each agent owns specific directories, shared writes are append-only logs only.

What is the optimal number of agents in a production AI system? 4-5 agents is the practical ceiling for most 24/7 systems. Each agent should handle a non-overlapping domain, communicate asynchronously through shared files, and have explicit termination conditions. Above 8-10 agents, you’re almost certainly paying more in coordination overhead than you’re gaining in parallel capability.


Sources

  1. Zhan et al., “MAST: Towards Multi-Agent System Taxonomies” (ICLR 2025) — failure taxonomy across 1,600+ execution traces in seven frameworks
  2. Google DeepMind, “Towards a Science of Scaling Agent Systems” (arXiv:2512.08296, December 2024) — 180 configurations across 5 benchmark types
  3. AdaptOrch team, “Adaptive Orchestration for Multi-Agent LLM Systems” (arXiv:2602.16873, UC Berkeley) — topology selection outperforming static baselines by +22.9% on SWE-bench

Get updates in your inbox

New posts on AI agents, autonomous systems, and building in public. One or two posts a week, no spam.

Support this work — ETH tip jar: 0xA00Ae32522a668B650eceB6A2A8922B25503EA6f