Category: Agent Building | 2,800 words
Most agent tutorials start with a single model. One system prompt, one API call, one model handling everything from “reformat this JSON” to “analyze whether this contract clause creates indemnification risk.” That works fine in demos. In production, it’s either expensive or broken—often both.
The gap between a prototype and a profitable AI system is frequently a routing decision that nobody made explicitly. Either the team defaulted to a capable-but-expensive model because they were afraid of quality issues, or they defaulted to a cheap model and started noticing strange failures they couldn’t trace back to model capability limits. Model routing—the deliberate assignment of tasks to models based on task characteristics—is how production fleets bridge that gap.
This post is about the practical engineering of that decision: what routing is, when it pays off, how to build a routing table, what the math looks like, and what fails when you get it wrong.
The Routing Problem
An agent fleet runs many types of tasks. A content pipeline might include: extracting structured data from raw HTML, classifying whether extracted content is relevant, generating a draft from the relevant content, reviewing that draft for quality, and publishing. Those five steps have wildly different computational demands.
HTML extraction is a pattern-matching task. It requires reading input carefully and producing structured output. A capable but small model handles it fine. Draft review requires understanding nuance, catching subtle inaccuracies, and exercising editorial judgment. A small model will miss things that matter.
The naive solution—“use Haiku for simple tasks, Opus for complex ones”—sounds right until you try to implement it. The problem is that “simple” and “complex” are not knowable a priori for every task type. You don’t know whether a given input to your classifier has edge cases until after classification fails. And when it fails, you often don’t know it failed—the downstream agent just receives a wrong answer and acts on it.
This creates two distinct failure modes in both directions:
Under-routing (too cheap): A classification task is routed to a small model. The small model misclassifies a borderline case. A downstream agent acts on the wrong classification. Three steps later, a task fails in a way that’s expensive to diagnose and costs hours of rework. The $0.002 saved on the classification call resulted in $50 of engineering time and a failed deliverable.
Over-routing (too expensive): A status-check task—“does this text confirm or deny receipt?”—is sent to Opus. The answer is binary and the text is unambiguous. Opus produces the same result as Haiku would have. You paid 15x to 94x more per call for identical output. At scale, this is not a minor inefficiency; it’s a structural cost problem.
Both failure modes are common. Neither is obvious until you’re deep enough in production to have data on task performance by model tier.
Task Taxonomy for Routing Decisions
Routing decisions should be grounded in task characteristics, not guesswork. The following taxonomy covers the majority of tasks in production agent fleets.
Tier 1: Token-Heavy, Low-Reasoning (Haiku)
These tasks involve reading substantial input and producing structured output, but do not require inference beyond pattern recognition. The model needs to be accurate and fast, not creative or deeply analytical.
Examples:
- Extracting fields from documents (invoices, forms, HTML)
- Reformatting structured data (JSON → Markdown, CSV → JSON)
- Summarization of well-structured content with clear facts
- Basic string classification where categories are unambiguous
- Translation between formats with explicit mappings
- Deduplication and canonicalization
A competent small model handles these reliably. Errors happen at the margins—unusual formats, malformed inputs—but the base rate of success is high enough that routing to Haiku is appropriate, with a fallback escalation path for failures.
Tier 2: Moderate Reasoning, Context-Sensitive (Sonnet)
These tasks require integrating context, handling ambiguity, or making judgment calls where the correct answer depends on factors that vary across inputs. Pure pattern matching isn’t enough; the model needs to reason about what the pattern means.
Examples:
- Classification with edge cases or overlapping categories
- Multi-step tool calling with conditional logic
- First-pass drafting from a brief or outline
- Summarization of complex, contradictory, or nuanced content
- Code generation for well-specified tasks
- Entity extraction where context disambiguates entities
Sonnet-tier tasks are the majority of “real work” in most fleets. They’re too complex for reliable Haiku execution but don’t require the heaviest reasoning capabilities. Most agent orchestration logic falls here.
Tier 3: High-Reasoning, High-Stakes (Sonnet or Opus)
These tasks require either sustained multi-step reasoning, adversarial checking, or produce outputs where an error costs significantly more than the model call. The question to ask: if the model gets this wrong, what’s the cost?
Examples:
- Complex planning with many interdependent constraints
- Legal, compliance, or risk analysis where misses have downstream consequences
- Adversarial review (checking your own fleet’s outputs for errors)
- Final editorial review where quality determines publication
- Tasks where the model needs to maintain coherent state across many steps
- Security or safety classification where false negatives are expensive
For Tier 3, the routing decision is a risk calculation: the model cost is small relative to the cost of failure. Use Opus when a wrong answer is materially expensive. Use Sonnet when quality is important but the failure cost is bounded.
Routing Table
| Task Characteristic | Recommended Tier | Example |
|---|---|---|
| Extraction / reformatting | Haiku | JSON field extraction, format conversion |
| Unambiguous binary classification | Haiku | Confirm/deny, present/absent |
| Summarization (structured input) | Haiku | News → bullets, invoice → summary |
| Classification with edge cases | Sonnet | Category assignment, sentiment with nuance |
| Multi-step tool use | Sonnet | Research → draft → cite |
| Code generation (specified task) | Sonnet | Function from docstring |
| Complex planning | Sonnet/Opus | Multi-dependency task decomposition |
| Adversarial / quality review | Sonnet/Opus | Fact-check, editorial pass |
| Legal / risk / compliance | Opus | Contract clause analysis |
| High-stakes final decision | Opus | Publish gate, escalation decision |
Production Cost Math
As of early 2026, Claude pricing is approximately:
- Haiku: $0.80/M input tokens, $4/M output tokens
- Sonnet: $3/M input tokens, $15/M output tokens
- Opus: $15/M input tokens, $75/M output tokens
These differ by roughly 4x between tiers on input and output. The spread is large enough that routing decisions have real financial consequences at any meaningful scale.
Take a concrete fleet: 500 tasks/day, averaging 2,000 input tokens and 500 output tokens per task.
Daily token volume:
- Input: 500 × 2,000 = 1,000,000 tokens (1M)
- Output: 500 × 500 = 250,000 tokens (0.25M)
All-Haiku:
- Input cost: 1M × $0.80/M = $0.80
- Output cost: 0.25M × $4/M = $1.00
- Daily: $1.80 | Monthly: ~$54
All-Sonnet:
- Input cost: 1M × $3/M = $3.00
- Output cost: 0.25M × $15/M = $3.75
- Daily: $6.75 | Monthly: ~$202
All-Opus:
- Input cost: 1M × $15/M = $15.00
- Output cost: 0.25M × $75/M = $18.75
- Daily: $33.75 | Monthly: ~$1,013
Routed (80% Haiku / 15% Sonnet / 5% Opus):
400 Haiku tasks:
- Input: 0.8M × $0.80 = $0.64; Output: 0.2M × $4 = $0.80 → $1.44/day
75 Sonnet tasks:
- Input: 0.15M × $3 = $0.45; Output: 0.0375M × $15 = $0.56 → $1.01/day
25 Opus tasks:
- Input: 0.05M × $15 = $0.75; Output: 0.0125M × $75 = $0.94 → $1.69/day
Total routed: $4.14/day | Monthly: ~$124
| Strategy | Daily Cost | Monthly Cost | vs. All-Opus |
|---|---|---|---|
| All-Haiku | $1.80 | $54 | -94.7% (but quality breaks) |
| All-Sonnet | $6.75 | $202 | -80.0% |
| Routed 80/15/5 | $4.14 | $124 | -87.7% |
| All-Opus | $33.75 | $1,013 | baseline |
The routed fleet achieves 87.7% cost reduction versus all-Opus while maintaining quality where it matters. Compared to all-Sonnet—often the “sensible default” for teams who want quality without going to Opus—routing still saves 38.6% while buying Opus-tier quality on the 5% of tasks that need it.
That 38.6% against Sonnet translates to ~$78/month at this scale, or about $936/year for a single fleet. For teams running multiple fleets or higher volumes, the numbers scale linearly with task count.
The important caveat: these savings are only realized if the routing is correct. An incorrect routing that sends 10% of tasks to the wrong tier in the wrong direction can erase the savings or introduce failures that cost more than the model savings. More on this in the failure modes section.
The Plan-and-Execute Pattern
The most effective routing architecture in agent fleets is not a static lookup table—it’s a plan-and-execute loop where a capable model makes the routing decisions dynamically.
Structure:
-
Planner (Sonnet or Opus): Receives the high-level task. Breaks it into subtasks. For each subtask, assigns a model tier and justification. Returns a task graph with explicit routing decisions.
-
Executors (Haiku or Sonnet): Run individual subtasks with the model the planner specified. They don’t make routing decisions—they execute.
-
Reviewer (Sonnet or Opus): Checks executor outputs before returning results or passing to the next stage. Decides whether to accept the output, retry with a higher-tier model, or escalate.
This pattern has a key property: expensive reasoning happens once per task at the planning step, not once per subtask. If a task decomposes into eight subtasks and the planner runs on Sonnet, you pay Sonnet cost once. The eight Haiku executors run at Haiku cost. The reviewer runs on Sonnet once. The blended cost is much closer to Haiku than to Sonnet, while the plan quality is Sonnet-tier.
In practice, the planner output looks like:
Task: Research and summarize competitive landscape for product X
Subtasks:
1. Extract company names from provided URL list → Haiku
2. For each company, extract product features from their docs → Haiku
3. Classify each company by primary market segment → Sonnet (edge cases expected)
4. Identify cross-cutting themes across segments → Sonnet
5. Draft executive summary → Sonnet
6. Review summary for accuracy against source material → Sonnet
The planner’s routing annotations are not arbitrary—they’re based on task characteristics the planner can observe: input structure, expected output complexity, downstream sensitivity. A good planner prompt makes the routing taxonomy explicit, giving the model a framework to reason against.
The reviewer’s role is equally important. It acts as a quality gate that can catch under-routing failures before they propagate. If a Haiku executor produces an output the reviewer flags as uncertain, the reviewer can trigger a retry with Sonnet rather than passing a wrong answer forward.
This pattern was documented academically in research on cascaded LLM systems: FrugalGPT (Chen et al., arXiv:2305.05176) demonstrated that sequential model querying—trying cheaper models first and escalating on low confidence—achieves up to 98% cost reduction while matching the performance of the most capable model. The plan-and-execute pattern is a structural application of the same principle at the agent fleet level rather than the individual query level.
More recent work on multi-agent routing—specifically MasRouter (ACL 2025), which addresses routing in multi-agent systems with explicit cost-performance optimization—confirms that routing decisions made at the system level (by a routing agent or planner) outperform per-query static rules, particularly on tasks with variable complexity. The key insight: a capable model routing cheap models outperforms a set of cheap models routing themselves.
Routing Failure Modes
Routing is not a set-and-forget decision. These are the failure modes that appear in production fleets that have implemented routing without ongoing maintenance.
Failure Mode 1: Routing Table Staleness
Model capabilities change. Haiku in December is not the same model as Haiku in March—providers update models continuously, sometimes with documented capability changes, sometimes without. A routing table built on observed performance from six months ago may now be incorrect.
If Haiku’s classification reliability on your edge-case categories has improved, you’re leaving money on the table by routing those to Sonnet. If Sonnet’s code generation has regressed on a specific pattern your fleet relies on, tasks you thought were safe at Sonnet tier are now quietly failing.
Mitigation: Treat routing decisions as hypotheses. Run a baseline evaluation on model tiers quarterly. If your routing table was built on empirical task success rates, re-run those benchmarks after model updates. Do not assume stability.
Failure Mode 2: Silent Task Complexity Underestimation
The planner classifies a task as Haiku-tier. The task has edge cases the planner didn’t anticipate—an unusual input format, a borderline classification, a domain term the small model was not trained on. The Haiku executor produces a confident-looking wrong answer. There’s no error. No exception. No low-confidence flag. The output propagates downstream.
This is the hardest failure mode to catch because it’s invisible until something breaks further down the pipeline. By then, the root cause—a mis-tier’d task three steps back—has been overwritten and is hard to reconstruct.
Mitigation: Add structured confidence signals to executor outputs. Require executors to produce a brief self-assessment alongside their output (“confidence: high/medium/low, reason: …”). Route medium/low confidence outputs to the reviewer for secondary evaluation before propagation. This adds latency but catches a class of silent failures that otherwise become expensive debugging sessions.
Failure Mode 3: Cascade Over-Routing from Upstream Failures
A Haiku executor produces a wrong answer. A downstream agent receives it, encounters an inconsistency, and escalates. The escalation logic is set up to send escalated tasks to Opus to ensure resolution. Opus processes the escalation—using a full Opus call—when the actual problem was an upstream classification error that a simple retry with Sonnet would have caught.
You saved $0.002 at step 1. You spent $0.15 at step 5. The economics reversed, and the retry chain consumed more total time than if you’d used Sonnet at step 1.
Cascaded escalation paths are a common design in fault-tolerant agent systems, and they’re correct in principle. The failure is in the escalation trigger: escalating to the most expensive tier when the problem may only require one tier up.
Mitigation: Design escalation paths with graduated model tiers. An escalation from Haiku should retry with Sonnet, not immediately with Opus. Opus is the last resort, not the default escalation target. Log escalation frequency by task type—a high escalation rate on a given task type is a signal that the tier assignment is wrong, not that escalation is working as designed.
Failure Mode 4: Planner Overconfidence
The planner is running on Sonnet. The planner is also subject to the same complexity underestimation problem as the executors—it doesn’t always know what it doesn’t know. A planner that consistently under-routes tasks (assigning Haiku to tasks that need Sonnet) may do so because those tasks look simple in the abstract but are complex in execution.
Mitigation: Track downstream failure rates by planner-assigned tier. If Haiku-assigned tasks fail at significantly higher rates than Sonnet-assigned tasks, the planner’s routing judgment is systematically miscalibrated. Feed this data back as context to the planner, or add explicit routing heuristics to the planner’s system prompt based on observed failure patterns.
When Not to Route
Routing adds complexity. There are contexts where that complexity is not worth the cost savings it unlocks.
Single-task, low-volume workflows: If your agent runs 20 tasks per day, the cost difference between tiers is measured in cents per day. The engineering time to build, test, and maintain a routing system costs orders of magnitude more than the savings over any reasonable payback period. Use a single, sufficiently capable model. Optimize when you have data and scale.
Uniformly high-stakes tasks: If every task in your pipeline could cause significant downstream harm if wrong, and the cost difference between Sonnet and Opus is small relative to the risk, don’t route. Accept the higher cost as the cost of reliability. Some contexts—legal review, security analysis, compliance checks—have failure costs that dwarf any plausible model cost savings.
Tasks with no reliable quality signal: Routing systems that include escalation paths need to know when escalation is warranted. If you can’t define a reliable quality check for executor output, you can’t build a reviewer that escalates appropriately. A routing system without a quality signal is a routing system that can’t catch its own failures. Better to use a single tier you trust.
Very early stage systems: Don’t optimize routing before you have data on where failures occur. Build the system. Run it. Observe where model limitations actually cause problems. Route based on evidence, not speculation. Premature routing architecture is expensive to maintain and may not address the actual failure modes.
The Verdict
Model routing pays off when three conditions are met: your fleet runs more than a few hundred tasks per day, your tasks are heterogeneous (mixing high-reasoning and low-reasoning work), and you have observability into task success rates by type.
At 500 tasks/day with an 80/15/5 split, routing saves ~$78/month versus all-Sonnet and ~$890/month versus all-Opus. At 5,000 tasks/day, multiply those figures by ten. The math is not subtle.
The minimum viable routing implementation is not a complex ML classifier. It’s a planner that explicitly assigns tiers to subtasks, a reviewer that flags low-confidence outputs for escalation, and a graduated escalation path (Haiku → Sonnet → Opus, not Haiku → Opus). That architecture, implemented well, captures the majority of the available cost reduction while adding a manageable amount of complexity.
The overhead cost of over-engineering routing is real. Building a learned routing model, training it on historical task data, deploying it as a separate service, and maintaining it is a significant engineering investment. Before that investment pays off, you need scale that most teams don’t have. Start with a rules-based planner. Upgrade when the rules break.
The research backs this up. FrugalGPT (arXiv:2305.05176) showed that even a simple cascade—try cheap first, escalate on low confidence—achieves near-maximum cost efficiency. MasRouter (ACL 2025) extended this to multi-agent settings, confirming that explicit routing by a capable coordinator outperforms implicit routing embedded in each agent’s own decision-making. The architecture conclusion is the same: put routing intelligence at the coordination layer, run cheap models at the execution layer, and verify at the review layer.
In a fleet running eight active agents, routing between model tiers is the difference between running out of budget and having capacity to take on more work. A fleet that uses Opus for everything burns through inference budget fast. A fleet that routes deliberately uses the same budget to run more tasks, more reliably, at lower per-task cost.
Route deliberately. Don’t hedge. The models you save Opus for will perform better when they’re not diluted by tasks a Haiku could handle in its sleep.
References
-
Chen, L., Zaharia, M., & Zou, J. (2023). FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance. arXiv:2305.05176. Demonstrated that LLM cascade strategies—routing to cheaper models and escalating on low confidence—can reduce inference cost by up to 98% while matching the performance of the most capable model.
-
MasRouter: Learning to Route LLMs for Multi-Agent System. ACL 2025. Addresses routing in multi-agent systems with explicit cost-performance tradeoffs, demonstrating that system-level routing by a capable coordinator outperforms distributed per-agent routing decisions.
-
Ong, I., et al. (2024). A Unified Approach to Routing and Cascading for LLMs. arXiv:2410.10347. Formalizes the relationship between routing (selecting one model) and cascading (sequential escalation), providing a theoretical framework for optimal model selection strategies under cost constraints.