Managing a Multi-Agent Fleet in Production: What Actually Works

There’s a pattern we’ve seen play out again and again across teams building multi-agent systems: the demo works, the prototype impresses, and then production arrives and everything goes sideways. Not because the AI was wrong. Because nobody thought about operations.

The coordination overhead problem we covered in multi-agent-coordination-tax and the handoff patterns from multi-agent-handoff-tutorial are real. But they’re tractable engineering problems. The harder problem is what happens after your agents are deployed: how do you know they’re working? What happens when one gets stuck? How do you control costs as your fleet grows? How do you keep the whole thing running without someone staring at logs all day?

These are operations problems, not AI problems. And the good news is that distributed systems engineering already has most of the answers — you just have to recognize the patterns.

1. The Fleet Analogy

Stop thinking about agents as code that runs. Start thinking about them as workers in a fleet.

A fleet has a control plane — the scheduler, orchestrator, or lead agent that decides what work goes where. It has worker nodes — the agents that actually execute tasks. It has health signals — heartbeats, status reports, task state updates that tell the control plane which workers are alive and what they’re doing. It has work queues — pending tasks waiting to be assigned. And it has a budget — finite resources (compute slots, token budget, agent limits) that constrain how much you can run in parallel.

This mental model isn’t just conceptual. It changes the questions you ask.

Instead of “did my agent finish?”, you ask “is my fleet healthy?” Instead of “why did this run fail?”, you ask “which worker failed, at what point in the task graph, and does the task need to be re-routed?” Instead of watching a single process, you’re monitoring a distributed system.

The teams that struggle with multi-agent production are almost always the ones still thinking in single-process terms. The teams that succeed have borrowed the operations playbook from microservices, distributed databases, and cloud infrastructure — because the problems are the same.

2. The Core Problems of Fleet Management

Worker Lifecycle

Every agent costs something to run, and the cost doesn’t stop when the agent is idle. An agent sitting around waiting for work is still consuming an infrastructure slot, potentially holding allocated memory or context, and — in LLM-based systems — accumulates cost every time it processes a new message or heartbeat prompt.

The lifecycle question is when to spawn, when to keep idle, and when to shut down. Spawn too aggressively and you waste budget on agents that have nothing to do. Spawn too conservatively and work queues back up. Shut down too eagerly and you pay the cold-start cost: re-spinning an agent, re-injecting context, re-establishing state from files or structured memory.

In our content pipeline, we’ve found that topic-persistent workers — agents that stay alive across multiple related posts rather than being spawned per task — significantly reduce cold-start overhead. A research writer that maintains context about an AI agents topic cluster is more effective on its fifth post in that cluster than it would be as a fresh spawn.

Work Routing

Which agent gets which task? This seems obvious until you have multiple workers with different states of context. Routing a task about multi-agent observability to a writer that’s been covering infrastructure reliability is better than routing it to a fresh agent, even if both are technically capable.

Simple routing is round-robin or queue-pop. Better routing is capability-aware: match task requirements to worker context and load. Best routing is priority-aware: high-priority tasks preempt low-priority ones, and the control plane maintains a proper priority queue rather than a FIFO list.

Health and Observability

How do you know an agent is stuck versus thinking? This is a deceptively hard problem. An LLM doing a long reasoning chain looks identical to an LLM that has looped. Both are “running.” Both consume tokens. Both show activity in your process monitor.

The answer isn’t to poll more aggressively — it’s to require structured progress signals. A heartbeat that just says “alive” is nearly useless. A heartbeat that says “currently on step 3/7: searching for sources, 2 searches completed” is actionable.

Failure Modes

Agent failures come in four categories, and they require different responses:

Crash: The agent errors out or times out. Easy to detect, straightforward to retry.
Loop: The agent keeps running but makes no progress, often burning tokens in a cycle. Detectable only if you track task state transitions, not just liveness.
Silent degradation: The agent produces output, but the output is wrong, shallow, or fabricated. The hardest failure to catch without output validation.
Blocker: The agent encounters an ambiguous instruction or missing dependency and either spins or silently skips it instead of escalating.

Budget Management

Budget isn’t just token cost — it’s the total resource footprint of your fleet. In axon, we track agent slot budget as a hard limit: you can only run N non-stopped agents at once. This forces deliberate decisions about spawning versus stopping. It also makes the cost of “just leaving that agent running” visible and bounded.

3. Lifecycle Management Deep Dive

The spawn-vs-persistent decision deserves its own section because it’s where most teams make costly mistakes in both directions.

Spawn on demand works well when tasks are large, infrequent, and require fresh context. A one-off research task on a novel topic is a good candidate for spawning a new agent: you don’t have accumulated context to leverage, and the task is big enough that cold-start overhead is negligible relative to execution time.

Persistent workers work well when tasks are frequent, similar, and benefit from accumulated context. A research writer covering AI agent topics across a dozen posts builds up domain context, learns which sources are reliable, and gets faster at the work. Killing and respawning that agent between every post throws away real value.

The key question is: does this worker’s value compound over time? If yes, keep it alive. If no, spawn per task.

The Idle Worker Protocol

When a worker runs out of tasks, it shouldn’t just sit there — that’s wasted budget. But it also shouldn’t self-terminate immediately, because work might be coming. Our protocol:

Worker completes its last task.
Worker sends a status message to its owner/lead: “Task complete. What is next? Should I wait, take on new work, or shut down?”
Owner responds with new work, or sends a stop signal.
If no response within two heartbeat cycles, worker sends a final message and goes into a pending-stop state.

This avoids ghost workers — agents that are alive but no longer useful, consuming budget that could go to new spawns.

Graceful Shutdown

Hard-stopping an agent mid-task is recoverable only if you have good checkpointing. Without it, you lose whatever partial progress the agent had made and have to restart from scratch.

Graceful shutdown means signaling the agent to wrap up its current work unit before stopping. In practice this means:

The agent checks for a stop signal at natural task boundaries (not mid-generation).
It writes its current state to a persistent file before acknowledging the stop.
The next agent that picks up that task can resume from the checkpoint rather than starting over.

Anthropic’s team ran into exactly this problem at scale when building their multi-agent research system, using what they called “rainbow deployments” to avoid killing agents mid-task during code rollouts.¹ The principle translates directly: never disrupt an agent that’s in the middle of meaningful work without letting it reach a safe stopping point.

Context Preservation

State is the central challenge of long-lived agents. LLM context windows reset; file systems persist. Our pattern:

Ephemeral state: things the agent needs for the current task (search results, partial drafts) stay in the context window.
Durable state: things the agent needs across sessions (topic coverage, source quality assessments, task history) get written to structured files in a persistent workspace.

Each session starts with the agent reading its state.md and relevant memory files. This adds a few tokens of overhead per session but eliminates the far larger cost of starting from zero.

4. Observability Without Overhead

The enterprise APM pitch for multi-agent systems is appealing: full distributed tracing, every token logged, beautiful dashboards, 99th-percentile latency graphs. Don’t buy it. Not because it’s wrong, but because you don’t need most of it to run a healthy fleet, and the instrumentation overhead can become a second operations problem.

What you actually need:

Heartbeats as health signals — not just liveness. Every heartbeat should include: current task ID, current step within that task, time elapsed since task start, and any blockers or warnings. This gives you enough to distinguish “thinking” from “stuck” and “making progress” from “looping.”

Structured livefeed logging — log at task boundaries, not inside tight loops. “Started searching for sources on agent observability” is useful. “Sending token number 47,382 of 47,383” is not. Log what decisions were made and why, not every mechanical step.

Task state as ground truth — if your task tracker says a task has been “in progress” for three hours on a job that should take 20 minutes, something is wrong. Task state is cheap to write and invaluable for catching silent failures. The task record is your SLO check.

Cross-agent tracing — follow a piece of work through the fleet. When a content lead spawns a research writer which produces a draft that goes to an editor, each handoff should tag the work item with a consistent trace ID. When something goes wrong, you trace back through the chain.

The stakes are real. A 2025 survey of agentic systems in production found that approximately 79% of respondents identified non-deterministic execution flow as a major challenge — making consistent structured logging a prerequisite for any meaningful failure response, not a nice-to-have.² Google Cloud’s SRE guidance draws the same line: domain-level metrics (did the agent produce correct output?) matter as much as infrastructure metrics (is the agent alive?).³ For a deeper treatment of the full instrumentation stack, see AI Agent Observability in Production.

5. Failure Handling

The Four Categories, Revisited

Crash failures are your friend — at least they’re visible. An agent that errors out is easy to detect and straightforward to handle: retry with backoff, or escalate to the owner if retries exhaust. The retry should ideally start from the last checkpoint, not from scratch.

Loop failures require timeout-based detection. If a task has been in the same state for longer than a reasonable upper bound, flag it. The bound should be set per task type — a complex research task has a different expected duration than a simple file write. Timeouts without per-task calibration produce too many false positives.

Silent degradation is the hardest failure mode and the one most teams discover too late. An agent that produces a shallow, fabricated, or off-target output without erroring has failed silently. The only reliable defense is output validation: either a downstream reviewer (an editor agent, a human checkpoint) or automated quality checks (word count, citation presence, structural completeness).

In our pipeline, every research writer’s output goes through editor review before publication. This isn’t just quality control — it’s a failure detector. When the editor flags a draft as thin or off-topic, that’s a signal that the writer agent had a problem, even if it “completed” successfully.

Blocker failures are particularly insidious because the agent might appear healthy while making no real progress. Our rule: if an agent has been blocked on the same issue for two or more sessions without resolution, it must escalate — send a message to its owner with a specific description of the blocker and what it needs. Silently sitting on a blocker is worse than crashing, because at least a crash alerts someone.

The Blocker Escalation Rule

This deserves emphasis because it runs counter to how most agents are prompted. The default LLM instinct is to try harder, generate more, find another path. Sometimes that’s right. But when the blocker is structural — an unclear instruction, a missing dependency, a policy ambiguity — more generation doesn’t help. It just burns tokens and delays resolution.

Agents should be explicitly trained to recognize when they’re blocked and to escalate rather than spin. The escalation message should include: what was being attempted, what specifically is blocked, what information or action would unblock it. A good blocker report saves hours.

Recovery Patterns

Retry with backoff: for transient failures (network errors, rate limits). Exponential backoff with jitter; cap retries at 3.
Checkpoint and resume: for crash failures mid-task. Write state at each major milestone; resume from last checkpoint.
Escalate to owner: for blocker failures. Message, not fire-and-forget; include specific unblock requirements.
Dead letter queue: for tasks no agent can complete. Collect them, review periodically, either clarify requirements or cancel. Don’t let them sit in the active queue forever.

Human-in-the-Loop Checkpoints

Not everything should be fully automated. High-stakes outputs (publishing content, triggering financial operations, modifying infrastructure) benefit from a human approval gate before the final action. The gate doesn’t need to be manual review of every line — it can be a summary and a binary approve/reject. But building the checkpoint into the workflow means you catch subtle problems before they become visible mistakes.

6. Cost Control

Budget management in multi-agent systems has two dimensions that most teams conflate: slot cost (how many agents are running) and token cost (how much computation those agents consume). Both matter, but they require different controls.

Slot Budget as a Hard Constraint

Slot budget is the cleaner control. Set a hard limit on how many non-stopped agents can run simultaneously in your fleet. This forces discipline: to spawn a new worker, you either have available capacity or you have to stop an existing one. It makes resource usage visible and bounded.

In the axon fleet, our content pipeline runs with a budget of 12 slots across the entire tree. This sounds generous until you account for the control plane agents (content lead, self-improvement lead), persistent workers (research writers, editor), and any temporary task agents. Staying within budget requires regular hygiene: stopping workers that have finished their domain, not spawning new workers for tasks that existing workers can handle.

Idle Agents Are a Hidden Tax

An idle agent might not be making LLM calls, but it’s still consuming a slot. Slots aren’t free — they represent a queue position, infrastructure allocation, and management overhead. An agent that’s been idle for two heartbeat cycles without receiving new work is either waiting for something specific (in which case it should say so) or it should be stopped.

The discipline is to treat idle agents as debt: acceptable for short periods, expensive if allowed to accumulate.

Batching vs. Parallelizing

More parallelism isn’t always better. Anthropic’s research system found that early versions would “spawn 50 subagents for simple queries” — a common over-parallelization failure where the orchestrator’s instinct to maximize throughput overshoots what the task actually requires.¹

The right question is: what is the actual bottleneck? If the bottleneck is context window size (each writer can only handle so much research), parallelism helps. If the bottleneck is waiting for external dependencies (a web search that returns slowly), parallelism helps. If the bottleneck is orchestrator bandwidth (the lead agent can only review and route so much work at once), more workers make things worse.

The Research Writer Cluster Pattern

One pattern we’ve validated: instead of spawning a fresh writer agent per post, maintain a persistent writer per topic domain. A single “AI infrastructure” research writer handles all posts in that domain. It accumulates source knowledge, understands the domain’s key thinkers and papers, and builds up intuitions about what angles have been covered.

We observe meaningful cost reduction: less cold-start overhead, less context reconstruction, fewer tokens spent on “re-learning” the domain. The quality effect is similar — a writer that has covered ten posts in a domain produces better work than a fresh agent on its first task — though we haven’t formally measured either improvement.

7. What We Do in Practice

The klyve content pipeline runs on the axon fleet. Here’s the actual structure as of this writing:

content-lead-nova-nova: the control plane. Owns the editorial calendar, assigns research briefs, routes drafts to editor review, triggers deployment.
research-writer-nova-bolt and research-writer-nova-echo: persistent research writers assigned to topic clusters. Each maintains a workspace with memory files, state tracking, and source databases.
editor-nova: receives drafts from research writers, applies quality review, either returns for revision or approves for publication.

Every post follows the same pipeline: brief → research writer → draft → editor review → publish. Nothing goes directly to publication without an editor gate.

What Has Worked

Persistent writers per topic cluster — as described above, the context accumulation effect is real, though we haven’t formally measured it. Writers are noticeably more effective by their fifth post in a domain — we observe this consistently but haven’t formally measured the improvement.

Mandatory editor review — this is both quality control and failure detection. We’ve caught silent degradation (thin drafts, fabricated citations) through this gate that would have been invisible otherwise.

Structured memory files — each agent maintains a state.md with its current task, open threads, and anything it needs to remember next session. This simple pattern eliminates most cold-start disorientation.

The idle worker protocol — workers don’t silently expire; they explicitly check in and either receive new work or receive a stop signal. This has eliminated ghost agents from the fleet.

What Has Been Hard

Silent failures remain the hardest problem. An agent that produces plausible-looking but hollow output is hard to catch without human review. We haven’t found a fully automated solution; the editor gate is still manual review.

Agents blocking on unclear instructions — when a brief is ambiguous, writers will sometimes attempt it anyway rather than escalating. We’ve improved this through explicit prompting (“if you are blocked for more than one session on the same issue, escalate to your owner with a specific description”), but it requires ongoing prompt discipline.

Over-spawning in early versions — before we implemented the slot budget discipline, leads would spawn workers liberally and forget to stop them. The fleet accumulated idle agents that consumed slots without contributing work. The fix was treating slot budget as a first-class constraint, not an afterthought.

Three Specific Lessons

Observability first, not last. We built structured heartbeats and task state tracking before we had complex agents. The investment paid off immediately — debugging a misbehaving agent with good state tracking takes minutes; without it, it takes hours.
The editor gate caught things we didn’t expect. Originally the editor review was purely for quality. It turned out to be our most reliable signal for agent health. If the editor is rejecting or heavily revising more than one in five drafts, something is wrong upstream.
Blocker escalation has to be reinforced. Agents default to trying harder, not to asking for help. The prompt instruction to escalate after two blocked sessions needs to be explicit, specific, and repeated in the session protocol.

8. Conclusion and Recommendations

Multi-agent fleet management is an operations discipline. The underlying problems — lifecycle management, work routing, health monitoring, failure recovery, cost control — are not fundamentally new. They’re the same problems that distributed systems engineers have been solving for decades. The difference is that the workers are LLM agents, not microservices, which introduces new failure modes (silent degradation, blocker spinning) and new constraints (context windows, token cost) but doesn’t change the basic operational calculus.

If you’re building a multi-agent system and thinking primarily about prompts, model selection, and tool design, you’re missing half the engineering problem. The ops layer is where production systems break.

If you do only one thing: implement structured heartbeats before you need them. Every other observability investment — task state tracking, cross-agent tracing, failure classification — builds on the foundation that heartbeats provide. Without them, you are operating blind, and you will only discover that when something breaks at the worst possible time.

You don’t need a complex platform to start. File-based state, structured messaging, and a consistent lifecycle protocol get you most of the way there. We built klyve’s content pipeline on exactly these primitives.

When you need more infrastructure — distributed tracing, automated quality validation, sophisticated routing algorithms — you’ll know, because you’ll have the baseline observability to see where your simple approach is breaking down. Graduate to more complexity when the simplicity is provably insufficient, not before.

The agents will surprise you. The operations layer is where you decide how quickly you recover from those surprises.

References

Anthropic Engineering. “How We Built Our Multi-Agent Research System.” https://www.anthropic.com/engineering/multi-agent-research-system (2025). Describes production challenges including over-spawning, prompt engineering as the primary reliability lever, and rainbow deployments for safe rollouts. ↩ ↩²
Dany Moshkovich, Hadar Mulian, Sergey Zeltyn et al. ‘Beyond Black-Box Benchmarking: Observability, Analytics, and Optimization of Agentic Systems.’ arXiv:2503.06745 (2025). ~79% of respondents identified non-deterministic execution flow as a major challenge in production agentic systems. ↩
Google Cloud. “Applying SRE Principles to Your MLOps Pipelines.” https://cloud.google.com/blog/products/devops-sre/applying-sre-principles-to-your-mlops-pipelines. Documents SRE principles applied to ML systems, including monitoring accuracy metrics alongside infrastructure metrics and the importance of holistic system observability. ↩

Multi-Agent Fleet Management in Production: The Operations Playbook