When you run AI agents in production, cost is a first-class concern. An agent that loops, retries excessively, or spawns more subagents than expected can consume an order of magnitude more tokens than you planned. Without hard limits, your cloud bill becomes unpredictable and your system has no defense against runaway execution.
This post covers the patterns I actually use: per-session token caps, per-task cost limits, graceful degradation under budget pressure, and how to wire this up without making your agents brittle.
Why Token Budgets Are Different From Rate Limits
Rate limits protect the API provider. Token budgets protect you.
Rate limits are imposed externally — you hit a limit, the API returns a 429, and your agent has to back off. Token budgets are something you enforce internally, before the call even goes out. The distinction matters because by the time you've hit a rate limit, you've often already spent what you wanted to avoid spending.
A token budget is a first-party control. You're saying: this session, this task, or this agent instance gets at most N tokens. When it hits the limit, you decide what happens — graceful stop, fallback behavior, human handoff, or hard abort. That decision logic is yours to own.
Key insight: Token budgets belong in your agent's reasoning loop, not just in your billing dashboard. By the time your dashboard shows a spike, the damage is done. Budget enforcement at the code level gives you real-time control.
Setting Budgets: Per-Session, Per-Task, Per-Agent
There are three useful granularities for token budgets, and the right one depends on what you're protecting against.
Per-session budgets
A session budget caps total token spend across everything an agent does in one run. This is the most common pattern. I typically set a session budget as a multiple of the expected task cost — if a task normally uses ~10K tokens, I'll cap the session at 50K. That gives room for retries and branching without enabling infinite loops.
Track this with a simple counter. Before every LLM call, check if tokens_used + estimated_call_size > session_budget. If yes, stop and handle it. After every response, add the actual token count to your tracker.
class TokenBudget:
def __init__(self, limit: int):
self.limit = limit
self.used = 0
def check(self, estimated: int) -> bool:
"""Returns True if we have budget for this call."""
return (self.used + estimated) <= self.limit
def consume(self, tokens: int):
self.used += tokens
@property
def remaining(self) -> int:
return max(0, self.limit - self.used)
@property
def exhausted(self) -> bool:
return self.used >= self.limit
Per-task budgets
When an agent handles multiple distinct tasks in a session, per-task budgets let you allocate differently based on task priority or complexity. A high-priority customer-facing task might get 30K tokens; a background indexing job gets 5K.
I implement this by passing a budget object into each task handler rather than sharing a global counter. Tasks can't borrow from each other, which keeps cost accounting clean.
Per-agent budgets in multi-agent systems
In a fleet where agents spawn subagents, per-agent limits prevent cascade spending. If an orchestrator spawns 10 workers and each can spend 20K tokens, you have a 200K token floor before any single task completes. Set per-agent limits that make sense for the worker's role, then track aggregate spend at the orchestrator level.
I run a system like this at Klyve. Each agent logs its token usage to a shared store, and the orchestrator checks fleet-wide spend before authorizing new spawns. If we're approaching the daily cap, the orchestrator stops spawning and routes remaining work to existing workers.
Enforcement Patterns: What Happens When You Hit the Limit
The budget limit itself is easy. The hard part is deciding what happens when you hit it. I've settled on three patterns depending on context:
Graceful stop with partial output
For long-running generation tasks — writing, analysis, summarization — the best response to a budget limit is to stop cleanly and return whatever's been produced so far. Partial output is often more useful than no output. The caller can decide whether to continue with a fresh session or use the partial result.
The key is making this explicit in the output. Return a flag like budget_exhausted: true along with a partial: true marker. Callers that don't check for this will silently treat partial results as complete — which is worse than getting an error.
Fallback to cheaper path
Before hard-stopping, try to route to a cheaper option. If you're running an expensive reasoning model (Opus-class) and hit 80% of budget, switch remaining steps to a cheaper model (Haiku-class) for lower-value operations like formatting, extraction, or simple lookups. This is the most cost-effective pattern when tasks have mixed-value steps.
This requires knowing your task structure well enough to label which steps are high-value and which aren't. It's worth the upfront design work. In practice, 60–70% of agent token spend goes to a small number of expensive reasoning calls; the rest is scaffolding you can run cheap.
Human handoff
For irreversible or high-stakes tasks — code deployments, external API writes, financial operations — hitting a budget limit should trigger a human handoff rather than a graceful stop or fallback. The agent should checkpoint its state, explain where it got to, and wait. Don't try to infer the right next step when you're running low on budget in a high-risk context.
Pattern to avoid: Silently continuing past budget. Some teams set soft limits and let agents keep running with a warning log. In practice, the warning gets ignored and the pattern trains you to treat limits as suggestions. Hard stops or explicit fallbacks only.
Estimating Costs Before They Happen
Reactive enforcement (stop when you've hit the limit) is necessary, but proactive estimation (predict whether you'll exceed budget before starting) is more valuable.
I do this at two levels:
Task-level estimation: Before kicking off a task, estimate how many tokens it will use based on task type, input size, and expected output length. This estimate doesn't need to be precise — order-of-magnitude accuracy is enough to decide whether to proceed, queue, or refuse.
Step-level preflight check: Before each LLM call within a task, check if remaining budget is sufficient for the call plus a safety margin. The safety margin should account for unexpectedly long model outputs. I use 2x the estimated output tokens as a margin.
For the preflight check to work, you need a rough model of how many tokens each step type consumes. Build this from real usage data — log every call with its token counts and tag it by step type. After a week of production data, you'll have solid priors.
Wiring Budget State Into Agent Prompts
One pattern that works surprisingly well: include remaining budget information in the agent's system prompt. Something like:
You have approximately 8,000 tokens remaining in this session budget.
Prioritize completing the core task. Skip optional analysis steps if needed.
If you cannot complete the task within budget, stop and report your progress.
This lets the model self-regulate. When it knows budget is tight, it tends to be more concise and skip lower-value elaboration on its own. It's not a substitute for hard enforcement — you still need code-level limits — but it meaningfully reduces the frequency of hitting hard limits.
Update this prompt injection dynamically as the session progresses. At 75% budget consumed, switch from "you have X tokens remaining" to "budget is tight — be concise." At 90%, make it explicit that the task should wrap up.
Monitoring and Alerting
Budget enforcement in code is table stakes. You also need visibility at the fleet level:
- Per-session cost distribution: Track p50, p90, p99 token spend per session type. Sudden shifts in p90 often indicate a new failure mode before it shows up in p50.
- Budget exhaustion rate: What fraction of sessions hit their budget limit? If it's above 5%, your limits are too tight or your tasks are expanding. Either is worth investigating.
- Cost per outcome: The metric that matters most isn't total tokens spent — it's tokens per successful task completion. An agent that uses 20K tokens and succeeds is better than one that uses 8K and fails half the time.
I use WatchDog (watch.klyve.xyz) to monitor these metrics across my agent fleet. Each agent reports token usage per session, and WatchDog flags anomalies — sessions that exceed 2x the rolling average spend get an immediate alert. That's how I caught a subagent loop last month before it compounded into a real cost spike.