LLM Tool Calling Patterns for Production Agents: 5 That Work

Tool calling is where agents meet reality. The LLM decides what to do, but tool calls are what actually does it — reads the file, hits the API, writes the code. After running an autonomous agent for hundreds of sessions, we've found that most tool-calling failures aren't about the tools themselves. They're about the patterns around how tools get called.

The Five Patterns We Actually Use

There are endless ways to wire tools into an agent. Most of the academic literature focuses on tool selection — getting the model to pick the right tool. That's table stakes. The real challenge is the orchestration: when to call tools in parallel, how to handle failures, and when to stop retrying.

Here are five patterns we rely on daily in our production agent loop:

1. Parallel Fan-Out for Independent Reads

When an agent needs to gather information from multiple independent sources, calling tools sequentially wastes time. If you need to read three files, check a service status, and pull analytics data — and none of these depend on each other — call them all at once.

# Instead of:
read file_a → read file_b → read file_c → check_status

Do:

[read file_a, read file_b, read file_c, check_status] # parallel → then proceed with all results

This seems obvious, but many agent frameworks serialize everything by default. The cost is real: in a 30-minute session, sequential reads can burn 3-5 minutes just waiting for round trips. Parallel fan-out cuts that to the latency of the slowest single call.

When it breaks: When you think calls are independent but they're not. Reading a config file and then reading a path specified in that config — those are sequential. The failure mode is calling both in parallel and getting an error on the second because you guessed the path wrong.

2. Validate-Then-Act (Never Trust Your Own Output)

After any write operation — creating a file, deploying a service, updating a database — immediately validate with a separate read. Don't trust the tool's success response alone.

# Bad:
write_file("config.json", data) → "Success" → move on

Good:

write_file(“config.json”, data) → “Success” → read_file(“config.json”) → verify contents match → then move on

This pattern catches silent failures that are surprisingly common: partial writes, permission issues that return success but don't persist, race conditions with other processes. We learned this the hard way when our agent "successfully" deployed a config that was actually empty — the write returned 200 but the disk was full.

The overhead is one extra tool call per write. The alternative is discovering the failure three steps later when something downstream breaks, and then spending ten tool calls debugging.

3. Fallback Chains with Escalation

When a tool call fails, don't retry the same call blindly. Instead, define a fallback chain: try the primary approach, then a simpler alternative, then escalate to a human or skip entirely.

# Fallback chain for getting data:
1. Try: API call to preferred source
2. Fallback: API call to alternative source
3. Fallback: Read cached data from disk
4. Escalate: Flag as "data unavailable" and ask for help

The key insight is that step 4 exists. Many agents get stuck in retry loops because they don't have an explicit "give up and escalate" option. An agent that retries the same failing API call five times isn't persistent — it's wasting tokens.

In our system, we set a hard limit: three attempts on any single operation, then move to the fallback. If all fallbacks fail, the agent logs the failure and sends a message asking for human intervention. The cost of asking for help is one message. The cost of an infinite retry loop is an entire session burned.

4. Tool Call Budgeting

Every tool call costs tokens (for the request/response), time (for the round trip), and context window space (for the result). Without budgets, agents drift into expensive spirals — especially on search and exploration tasks.

We enforce soft budgets per task:

These aren't hard limits enforced by code (though they could be). They're instructions in the agent's prompt that work because the model respects them. The trick is being specific — "don't use too many tool calls" doesn't work. "Max 5 searches per topic" does.

5. State Checkpointing Between Tool Chains

For multi-step operations (deploy a service, then verify, then update config, then test), write intermediate state to disk between steps. If the agent's session crashes or the context window fills up, the next session can resume from the last checkpoint instead of starting over.

# Before each major step:
write_checkpoint("deploy-step-3", {
  service: "deployed",
  config: "pending",
  tests: "pending"
})

At session start:

checkpoint = read_checkpoint(“deploy”) if checkpoint.config == “pending”: resume_from(“config_update”)

This pattern is essential for long-horizon tasks that span multiple sessions. Without it, the agent re-does work it already completed, which isn't just slow — it can be destructive if the "redo" conflicts with state from the first run.

We implement this with a simple JSON file that the agent reads at the start of every session. Each step writes its status. The overhead is minimal; the recovery benefit is enormous.

The Anti-Patterns

Equally important is knowing what not to do:

Measuring Tool Calling Quality

We track three metrics on our agent's tool usage:

  1. Tool calls per task: Are we getting more efficient over time, or drifting toward more calls?
  2. Error rate: What percentage of tool calls fail? If it's above 10%, something is wrong with our tool descriptions or the agent's understanding of when to use them.
  3. Wasted calls: Tool calls whose results were never used in the final output. These are pure waste — the agent explored a path and abandoned it.

Tracking these over time reveals patterns: maybe the agent always wastes calls when doing file exploration (solution: better glob patterns). Maybe error rates spike on a particular API (solution: add retry logic or a fallback). The data tells you where to invest in better tooling.

The Bigger Picture

Tool calling is the execution layer of an agent. Better tools matter, but better patterns for using tools matter more. A well-designed fallback chain with budget limits will outperform a collection of perfect tools called haphazardly.

The patterns above aren't theoretical — they're running in production right now, handling hundreds of tool calls per day across file operations, web searches, API calls, and system commands. They work because they're simple, explicit, and they assume failure is normal.

Start with validate-then-act and fallback chains. Those two patterns alone will eliminate most of the frustrating "the agent did something but it didn't actually work" failures that plague autonomous systems.

Frequently Asked Questions

Q: How do you decide which tool calls to run in parallel vs. sequentially?

The rule is simple: if call B needs data from call A's result, they must be sequential. Everything else can be parallel. In practice, reads (files, APIs, searches) are almost always parallelizable, while writes that depend on reads are sequential. When in doubt, run sequentially — incorrect parallel calls cause harder-to-debug errors than the time saved.

Q: What's the right retry limit for failed tool calls?

Three attempts maximum for the same operation with the same parameters. After that, either change the approach (different parameters, different tool) or escalate. The key is distinguishing transient failures (network timeout — retry makes sense) from permanent failures (wrong file path — retrying is pointless). If the error message is the same on attempt two, don't attempt three.

Q: How do you prevent agents from burning tokens on unnecessary tool calls?

Explicit budgets in the prompt work surprisingly well. Instead of vague "be efficient" instructions, specify concrete limits: "max 5 searches per topic," "max 10 file reads per exploration task." Also track wasted calls (calls whose results aren't used) — high waste rates indicate the agent is exploring too broadly before synthesizing.

Q: Should tool descriptions be detailed or minimal?

Detailed, but not verbose. Each tool description should include: what the tool does, when to use it (and when NOT to use it), what parameters mean, and what the output looks like. The "when not to use it" part is critical — it prevents the model from reaching for a powerful tool when a simpler one would do. Aim for 3-5 sentences per tool, not 3-5 paragraphs.

Q: How do checkpoints work across agent sessions that share no memory?

Write checkpoint state to persistent storage — a JSON file on disk, a database row, or an external key-value store. At session start, the agent reads the checkpoint file before doing anything else. The checkpoint contains the last completed step and any intermediate data needed to resume. This is the same pattern as database transaction logs, just applied to agent workflows.

Get updates in your inbox

New posts on AI agents, autonomous systems, and building in public. One or two posts a week, no spam.

Support this work — ETH tip jar: 0xA00Ae32522a668B650eceB6A2A8922B25503EA6f