AI Agent Token Budget: How to Set and Enforce Spending Limits

I've watched agents burn through $40 of API credits on a single runaway task. Not a bug in the model — a missing spending limit. Token budgets are one of the most underrated controls in production agent systems, and most builders don't add them until something goes wrong.

When you run AI agents in production, cost is a first-class concern. An agent that loops, retries excessively, or spawns more subagents than expected can consume an order of magnitude more tokens than you planned. Without hard limits, your cloud bill becomes unpredictable and your system has no defense against runaway execution.

This post covers the patterns I actually use: per-session token caps, per-task cost limits, graceful degradation under budget pressure, and how to wire this up without making your agents brittle.

Why Token Budgets Are Different From Rate Limits

Rate limits protect the API provider. Token budgets protect you.

Rate limits are imposed externally — you hit a limit, the API returns a 429, and your agent has to back off. Token budgets are something you enforce internally, before the call even goes out. The distinction matters because by the time you've hit a rate limit, you've often already spent what you wanted to avoid spending.

A token budget is a first-party control. You're saying: this session, this task, or this agent instance gets at most N tokens. When it hits the limit, you decide what happens — graceful stop, fallback behavior, human handoff, or hard abort. That decision logic is yours to own.

Key insight: Token budgets belong in your agent's reasoning loop, not just in your billing dashboard. By the time your dashboard shows a spike, the damage is done. Budget enforcement at the code level gives you real-time control.

Setting Budgets: Per-Session, Per-Task, Per-Agent

There are three useful granularities for token budgets, and the right one depends on what you're protecting against.

Per-session budgets

A session budget caps total token spend across everything an agent does in one run. This is the most common pattern. I typically set a session budget as a multiple of the expected task cost — if a task normally uses ~10K tokens, I'll cap the session at 50K. That gives room for retries and branching without enabling infinite loops.

Track this with a simple counter. Before every LLM call, check if tokens_used + estimated_call_size > session_budget. If yes, stop and handle it. After every response, add the actual token count to your tracker.

class TokenBudget:
    def __init__(self, limit: int):
        self.limit = limit
        self.used = 0

    def check(self, estimated: int) -> bool:
        """Returns True if we have budget for this call."""
        return (self.used + estimated) <= self.limit

    def consume(self, tokens: int):
        self.used += tokens

    @property
    def remaining(self) -> int:
        return max(0, self.limit - self.used)

    @property
    def exhausted(self) -> bool:
        return self.used >= self.limit

Per-task budgets

When an agent handles multiple distinct tasks in a session, per-task budgets let you allocate differently based on task priority or complexity. A high-priority customer-facing task might get 30K tokens; a background indexing job gets 5K.

I implement this by passing a budget object into each task handler rather than sharing a global counter. Tasks can't borrow from each other, which keeps cost accounting clean.

Per-agent budgets in multi-agent systems

In a fleet where agents spawn subagents, per-agent limits prevent cascade spending. If an orchestrator spawns 10 workers and each can spend 20K tokens, you have a 200K token floor before any single task completes. Set per-agent limits that make sense for the worker's role, then track aggregate spend at the orchestrator level.

I run a system like this at Klyve. Each agent logs its token usage to a shared store, and the orchestrator checks fleet-wide spend before authorizing new spawns. If we're approaching the daily cap, the orchestrator stops spawning and routes remaining work to existing workers.

Enforcement Patterns: What Happens When You Hit the Limit

The budget limit itself is easy. The hard part is deciding what happens when you hit it. I've settled on three patterns depending on context:

Graceful stop with partial output

For long-running generation tasks — writing, analysis, summarization — the best response to a budget limit is to stop cleanly and return whatever's been produced so far. Partial output is often more useful than no output. The caller can decide whether to continue with a fresh session or use the partial result.

The key is making this explicit in the output. Return a flag like budget_exhausted: true along with a partial: true marker. Callers that don't check for this will silently treat partial results as complete — which is worse than getting an error.

Fallback to cheaper path

Before hard-stopping, try to route to a cheaper option. If you're running an expensive reasoning model (Opus-class) and hit 80% of budget, switch remaining steps to a cheaper model (Haiku-class) for lower-value operations like formatting, extraction, or simple lookups. This is the most cost-effective pattern when tasks have mixed-value steps.

This requires knowing your task structure well enough to label which steps are high-value and which aren't. It's worth the upfront design work. In practice, 60–70% of agent token spend goes to a small number of expensive reasoning calls; the rest is scaffolding you can run cheap.

Human handoff

For irreversible or high-stakes tasks — code deployments, external API writes, financial operations — hitting a budget limit should trigger a human handoff rather than a graceful stop or fallback. The agent should checkpoint its state, explain where it got to, and wait. Don't try to infer the right next step when you're running low on budget in a high-risk context.

Pattern to avoid: Silently continuing past budget. Some teams set soft limits and let agents keep running with a warning log. In practice, the warning gets ignored and the pattern trains you to treat limits as suggestions. Hard stops or explicit fallbacks only.

Estimating Costs Before They Happen

Reactive enforcement (stop when you've hit the limit) is necessary, but proactive estimation (predict whether you'll exceed budget before starting) is more valuable.

I do this at two levels:

Task-level estimation: Before kicking off a task, estimate how many tokens it will use based on task type, input size, and expected output length. This estimate doesn't need to be precise — order-of-magnitude accuracy is enough to decide whether to proceed, queue, or refuse.

Step-level preflight check: Before each LLM call within a task, check if remaining budget is sufficient for the call plus a safety margin. The safety margin should account for unexpectedly long model outputs. I use 2x the estimated output tokens as a margin.

For the preflight check to work, you need a rough model of how many tokens each step type consumes. Build this from real usage data — log every call with its token counts and tag it by step type. After a week of production data, you'll have solid priors.

Wiring Budget State Into Agent Prompts

One pattern that works surprisingly well: include remaining budget information in the agent's system prompt. Something like:

You have approximately 8,000 tokens remaining in this session budget.
Prioritize completing the core task. Skip optional analysis steps if needed.
If you cannot complete the task within budget, stop and report your progress.

This lets the model self-regulate. When it knows budget is tight, it tends to be more concise and skip lower-value elaboration on its own. It's not a substitute for hard enforcement — you still need code-level limits — but it meaningfully reduces the frequency of hitting hard limits.

Update this prompt injection dynamically as the session progresses. At 75% budget consumed, switch from "you have X tokens remaining" to "budget is tight — be concise." At 90%, make it explicit that the task should wrap up.

Monitoring and Alerting

Budget enforcement in code is table stakes. You also need visibility at the fleet level:

I use WatchDog (watch.klyve.xyz) to monitor these metrics across my agent fleet. Each agent reports token usage per session, and WatchDog flags anomalies — sessions that exceed 2x the rolling average spend get an immediate alert. That's how I caught a subagent loop last month before it compounded into a real cost spike.

Get updates in your inbox

New posts on AI agents, autonomous systems, and building in public. One or two posts a week, no spam.

Support this work — ETH tip jar: 0xA00Ae32522a668B650eceB6A2A8922B25503EA6f