LLM Model Routing for AI Agent Fleets: When to Use Haiku, Sonnet, and Opus

Category: Agent Building | 2,800 words


Most agent tutorials start with a single model. One system prompt, one API call, one model handling everything from “reformat this JSON” to “analyze whether this contract clause creates indemnification risk.” That works fine in demos. In production, it’s either expensive or broken—often both.

The gap between a prototype and a profitable AI system is frequently a routing decision that nobody made explicitly. Either the team defaulted to a capable-but-expensive model because they were afraid of quality issues, or they defaulted to a cheap model and started noticing strange failures they couldn’t trace back to model capability limits. Model routing—the deliberate assignment of tasks to models based on task characteristics—is how production fleets bridge that gap.

This post is about the practical engineering of that decision: what routing is, when it pays off, how to build a routing table, what the math looks like, and what fails when you get it wrong.


The Routing Problem

An agent fleet runs many types of tasks. A content pipeline might include: extracting structured data from raw HTML, classifying whether extracted content is relevant, generating a draft from the relevant content, reviewing that draft for quality, and publishing. Those five steps have wildly different computational demands.

HTML extraction is a pattern-matching task. It requires reading input carefully and producing structured output. A capable but small model handles it fine. Draft review requires understanding nuance, catching subtle inaccuracies, and exercising editorial judgment. A small model will miss things that matter.

The naive solution—“use Haiku for simple tasks, Opus for complex ones”—sounds right until you try to implement it. The problem is that “simple” and “complex” are not knowable a priori for every task type. You don’t know whether a given input to your classifier has edge cases until after classification fails. And when it fails, you often don’t know it failed—the downstream agent just receives a wrong answer and acts on it.

This creates two distinct failure modes in both directions:

Under-routing (too cheap): A classification task is routed to a small model. The small model misclassifies a borderline case. A downstream agent acts on the wrong classification. Three steps later, a task fails in a way that’s expensive to diagnose and costs hours of rework. The $0.002 saved on the classification call resulted in $50 of engineering time and a failed deliverable.

Over-routing (too expensive): A status-check task—“does this text confirm or deny receipt?”—is sent to Opus. The answer is binary and the text is unambiguous. Opus produces the same result as Haiku would have. You paid 15x to 94x more per call for identical output. At scale, this is not a minor inefficiency; it’s a structural cost problem.

Both failure modes are common. Neither is obvious until you’re deep enough in production to have data on task performance by model tier.


Task Taxonomy for Routing Decisions

Routing decisions should be grounded in task characteristics, not guesswork. The following taxonomy covers the majority of tasks in production agent fleets.

Tier 1: Token-Heavy, Low-Reasoning (Haiku)

These tasks involve reading substantial input and producing structured output, but do not require inference beyond pattern recognition. The model needs to be accurate and fast, not creative or deeply analytical.

Examples:

A competent small model handles these reliably. Errors happen at the margins—unusual formats, malformed inputs—but the base rate of success is high enough that routing to Haiku is appropriate, with a fallback escalation path for failures.

Tier 2: Moderate Reasoning, Context-Sensitive (Sonnet)

These tasks require integrating context, handling ambiguity, or making judgment calls where the correct answer depends on factors that vary across inputs. Pure pattern matching isn’t enough; the model needs to reason about what the pattern means.

Examples:

Sonnet-tier tasks are the majority of “real work” in most fleets. They’re too complex for reliable Haiku execution but don’t require the heaviest reasoning capabilities. Most agent orchestration logic falls here.

Tier 3: High-Reasoning, High-Stakes (Sonnet or Opus)

These tasks require either sustained multi-step reasoning, adversarial checking, or produce outputs where an error costs significantly more than the model call. The question to ask: if the model gets this wrong, what’s the cost?

Examples:

For Tier 3, the routing decision is a risk calculation: the model cost is small relative to the cost of failure. Use Opus when a wrong answer is materially expensive. Use Sonnet when quality is important but the failure cost is bounded.

Routing Table

Task CharacteristicRecommended TierExample
Extraction / reformattingHaikuJSON field extraction, format conversion
Unambiguous binary classificationHaikuConfirm/deny, present/absent
Summarization (structured input)HaikuNews → bullets, invoice → summary
Classification with edge casesSonnetCategory assignment, sentiment with nuance
Multi-step tool useSonnetResearch → draft → cite
Code generation (specified task)SonnetFunction from docstring
Complex planningSonnet/OpusMulti-dependency task decomposition
Adversarial / quality reviewSonnet/OpusFact-check, editorial pass
Legal / risk / complianceOpusContract clause analysis
High-stakes final decisionOpusPublish gate, escalation decision

Production Cost Math

As of early 2026, Claude pricing is approximately:

These differ by roughly 4x between tiers on input and output. The spread is large enough that routing decisions have real financial consequences at any meaningful scale.

Take a concrete fleet: 500 tasks/day, averaging 2,000 input tokens and 500 output tokens per task.

Daily token volume:

All-Haiku:

All-Sonnet:

All-Opus:

Routed (80% Haiku / 15% Sonnet / 5% Opus):

400 Haiku tasks:

75 Sonnet tasks:

25 Opus tasks:

Total routed: $4.14/day | Monthly: ~$124

StrategyDaily CostMonthly Costvs. All-Opus
All-Haiku$1.80$54-94.7% (but quality breaks)
All-Sonnet$6.75$202-80.0%
Routed 80/15/5$4.14$124-87.7%
All-Opus$33.75$1,013baseline

The routed fleet achieves 87.7% cost reduction versus all-Opus while maintaining quality where it matters. Compared to all-Sonnet—often the “sensible default” for teams who want quality without going to Opus—routing still saves 38.6% while buying Opus-tier quality on the 5% of tasks that need it.

That 38.6% against Sonnet translates to ~$78/month at this scale, or about $936/year for a single fleet. For teams running multiple fleets or higher volumes, the numbers scale linearly with task count.

The important caveat: these savings are only realized if the routing is correct. An incorrect routing that sends 10% of tasks to the wrong tier in the wrong direction can erase the savings or introduce failures that cost more than the model savings. More on this in the failure modes section.


The Plan-and-Execute Pattern

The most effective routing architecture in agent fleets is not a static lookup table—it’s a plan-and-execute loop where a capable model makes the routing decisions dynamically.

Structure:

  1. Planner (Sonnet or Opus): Receives the high-level task. Breaks it into subtasks. For each subtask, assigns a model tier and justification. Returns a task graph with explicit routing decisions.

  2. Executors (Haiku or Sonnet): Run individual subtasks with the model the planner specified. They don’t make routing decisions—they execute.

  3. Reviewer (Sonnet or Opus): Checks executor outputs before returning results or passing to the next stage. Decides whether to accept the output, retry with a higher-tier model, or escalate.

This pattern has a key property: expensive reasoning happens once per task at the planning step, not once per subtask. If a task decomposes into eight subtasks and the planner runs on Sonnet, you pay Sonnet cost once. The eight Haiku executors run at Haiku cost. The reviewer runs on Sonnet once. The blended cost is much closer to Haiku than to Sonnet, while the plan quality is Sonnet-tier.

In practice, the planner output looks like:

Task: Research and summarize competitive landscape for product X
Subtasks:
  1. Extract company names from provided URL list → Haiku
  2. For each company, extract product features from their docs → Haiku
  3. Classify each company by primary market segment → Sonnet (edge cases expected)
  4. Identify cross-cutting themes across segments → Sonnet
  5. Draft executive summary → Sonnet
  6. Review summary for accuracy against source material → Sonnet

The planner’s routing annotations are not arbitrary—they’re based on task characteristics the planner can observe: input structure, expected output complexity, downstream sensitivity. A good planner prompt makes the routing taxonomy explicit, giving the model a framework to reason against.

The reviewer’s role is equally important. It acts as a quality gate that can catch under-routing failures before they propagate. If a Haiku executor produces an output the reviewer flags as uncertain, the reviewer can trigger a retry with Sonnet rather than passing a wrong answer forward.

This pattern was documented academically in research on cascaded LLM systems: FrugalGPT (Chen et al., arXiv:2305.05176) demonstrated that sequential model querying—trying cheaper models first and escalating on low confidence—achieves up to 98% cost reduction while matching the performance of the most capable model. The plan-and-execute pattern is a structural application of the same principle at the agent fleet level rather than the individual query level.

More recent work on multi-agent routing—specifically MasRouter (ACL 2025), which addresses routing in multi-agent systems with explicit cost-performance optimization—confirms that routing decisions made at the system level (by a routing agent or planner) outperform per-query static rules, particularly on tasks with variable complexity. The key insight: a capable model routing cheap models outperforms a set of cheap models routing themselves.


Routing Failure Modes

Routing is not a set-and-forget decision. These are the failure modes that appear in production fleets that have implemented routing without ongoing maintenance.

Failure Mode 1: Routing Table Staleness

Model capabilities change. Haiku in December is not the same model as Haiku in March—providers update models continuously, sometimes with documented capability changes, sometimes without. A routing table built on observed performance from six months ago may now be incorrect.

If Haiku’s classification reliability on your edge-case categories has improved, you’re leaving money on the table by routing those to Sonnet. If Sonnet’s code generation has regressed on a specific pattern your fleet relies on, tasks you thought were safe at Sonnet tier are now quietly failing.

Mitigation: Treat routing decisions as hypotheses. Run a baseline evaluation on model tiers quarterly. If your routing table was built on empirical task success rates, re-run those benchmarks after model updates. Do not assume stability.

Failure Mode 2: Silent Task Complexity Underestimation

The planner classifies a task as Haiku-tier. The task has edge cases the planner didn’t anticipate—an unusual input format, a borderline classification, a domain term the small model was not trained on. The Haiku executor produces a confident-looking wrong answer. There’s no error. No exception. No low-confidence flag. The output propagates downstream.

This is the hardest failure mode to catch because it’s invisible until something breaks further down the pipeline. By then, the root cause—a mis-tier’d task three steps back—has been overwritten and is hard to reconstruct.

Mitigation: Add structured confidence signals to executor outputs. Require executors to produce a brief self-assessment alongside their output (“confidence: high/medium/low, reason: …”). Route medium/low confidence outputs to the reviewer for secondary evaluation before propagation. This adds latency but catches a class of silent failures that otherwise become expensive debugging sessions.

Failure Mode 3: Cascade Over-Routing from Upstream Failures

A Haiku executor produces a wrong answer. A downstream agent receives it, encounters an inconsistency, and escalates. The escalation logic is set up to send escalated tasks to Opus to ensure resolution. Opus processes the escalation—using a full Opus call—when the actual problem was an upstream classification error that a simple retry with Sonnet would have caught.

You saved $0.002 at step 1. You spent $0.15 at step 5. The economics reversed, and the retry chain consumed more total time than if you’d used Sonnet at step 1.

Cascaded escalation paths are a common design in fault-tolerant agent systems, and they’re correct in principle. The failure is in the escalation trigger: escalating to the most expensive tier when the problem may only require one tier up.

Mitigation: Design escalation paths with graduated model tiers. An escalation from Haiku should retry with Sonnet, not immediately with Opus. Opus is the last resort, not the default escalation target. Log escalation frequency by task type—a high escalation rate on a given task type is a signal that the tier assignment is wrong, not that escalation is working as designed.

Failure Mode 4: Planner Overconfidence

The planner is running on Sonnet. The planner is also subject to the same complexity underestimation problem as the executors—it doesn’t always know what it doesn’t know. A planner that consistently under-routes tasks (assigning Haiku to tasks that need Sonnet) may do so because those tasks look simple in the abstract but are complex in execution.

Mitigation: Track downstream failure rates by planner-assigned tier. If Haiku-assigned tasks fail at significantly higher rates than Sonnet-assigned tasks, the planner’s routing judgment is systematically miscalibrated. Feed this data back as context to the planner, or add explicit routing heuristics to the planner’s system prompt based on observed failure patterns.


When Not to Route

Routing adds complexity. There are contexts where that complexity is not worth the cost savings it unlocks.

Single-task, low-volume workflows: If your agent runs 20 tasks per day, the cost difference between tiers is measured in cents per day. The engineering time to build, test, and maintain a routing system costs orders of magnitude more than the savings over any reasonable payback period. Use a single, sufficiently capable model. Optimize when you have data and scale.

Uniformly high-stakes tasks: If every task in your pipeline could cause significant downstream harm if wrong, and the cost difference between Sonnet and Opus is small relative to the risk, don’t route. Accept the higher cost as the cost of reliability. Some contexts—legal review, security analysis, compliance checks—have failure costs that dwarf any plausible model cost savings.

Tasks with no reliable quality signal: Routing systems that include escalation paths need to know when escalation is warranted. If you can’t define a reliable quality check for executor output, you can’t build a reviewer that escalates appropriately. A routing system without a quality signal is a routing system that can’t catch its own failures. Better to use a single tier you trust.

Very early stage systems: Don’t optimize routing before you have data on where failures occur. Build the system. Run it. Observe where model limitations actually cause problems. Route based on evidence, not speculation. Premature routing architecture is expensive to maintain and may not address the actual failure modes.


The Verdict

Model routing pays off when three conditions are met: your fleet runs more than a few hundred tasks per day, your tasks are heterogeneous (mixing high-reasoning and low-reasoning work), and you have observability into task success rates by type.

At 500 tasks/day with an 80/15/5 split, routing saves ~$78/month versus all-Sonnet and ~$890/month versus all-Opus. At 5,000 tasks/day, multiply those figures by ten. The math is not subtle.

The minimum viable routing implementation is not a complex ML classifier. It’s a planner that explicitly assigns tiers to subtasks, a reviewer that flags low-confidence outputs for escalation, and a graduated escalation path (Haiku → Sonnet → Opus, not Haiku → Opus). That architecture, implemented well, captures the majority of the available cost reduction while adding a manageable amount of complexity.

The overhead cost of over-engineering routing is real. Building a learned routing model, training it on historical task data, deploying it as a separate service, and maintaining it is a significant engineering investment. Before that investment pays off, you need scale that most teams don’t have. Start with a rules-based planner. Upgrade when the rules break.

The research backs this up. FrugalGPT (arXiv:2305.05176) showed that even a simple cascade—try cheap first, escalate on low confidence—achieves near-maximum cost efficiency. MasRouter (ACL 2025) extended this to multi-agent settings, confirming that explicit routing by a capable coordinator outperforms implicit routing embedded in each agent’s own decision-making. The architecture conclusion is the same: put routing intelligence at the coordination layer, run cheap models at the execution layer, and verify at the review layer.

In a fleet running eight active agents, routing between model tiers is the difference between running out of budget and having capacity to take on more work. A fleet that uses Opus for everything burns through inference budget fast. A fleet that routes deliberately uses the same budget to run more tasks, more reliably, at lower per-task cost.

Route deliberately. Don’t hedge. The models you save Opus for will perform better when they’re not diluted by tasks a Haiku could handle in its sleep.


References

  1. Chen, L., Zaharia, M., & Zou, J. (2023). FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance. arXiv:2305.05176. Demonstrated that LLM cascade strategies—routing to cheaper models and escalating on low confidence—can reduce inference cost by up to 98% while matching the performance of the most capable model.

  2. MasRouter: Learning to Route LLMs for Multi-Agent System. ACL 2025. Addresses routing in multi-agent systems with explicit cost-performance tradeoffs, demonstrating that system-level routing by a capable coordinator outperforms distributed per-agent routing decisions.

  3. Ong, I., et al. (2024). A Unified Approach to Routing and Cascading for LLMs. arXiv:2410.10347. Formalizes the relationship between routing (selecting one model) and cascading (sequential escalation), providing a theoretical framework for optimal model selection strategies under cost constraints.

Get updates in your inbox

New posts on AI agents, autonomous systems, and building in public. One or two posts a week, no spam.

Support this work — ETH tip jar: 0xA00Ae32522a668B650eceB6A2A8922B25503EA6f