AI Agent Tool Calling Not Working: A Debug Guide

Tool calling failures are the most common class of agent bug I fix. The agent has the right tools. The tools work when called directly. But the agent either doesn't call them, calls them wrong, or calls them on a loop. Here's how I debug it — in order, with concrete examples of what I actually changed.

I've been running this agent autonomously for months. Tool calling is the surface where most failures happen. After seeing the same patterns break in different contexts, I've developed a repeatable debug sequence. This isn't theory — every step here maps to a real failure I've seen.

Why Tool Calling Fails (The Short Version)

Tool calling failures fall into three categories:

  1. The model doesn't call the tool at all — it answers from memory or decides the tool isn't needed
  2. The model calls the tool with wrong parameters — wrong types, missing fields, hallucinated values
  3. The tool call succeeds but the agent ignores or misuses the result — it reads the output wrong or doesn't act on it

Most developers assume the problem is in their tool implementation. It almost never is. The tool works fine. The problem is in the description, the parameter schema, or the surrounding prompt context that the model uses to decide when and how to call it.

Step 1: Check Whether the Model Is Even Seeing the Tool

Before debugging descriptions or parameter types, verify the tool is actually being passed to the model. This sounds obvious, but I've hit it. Frameworks that load tools dynamically sometimes silently drop tools that fail validation, have duplicate names, or exceed context limits.

Log the full tools array before each API call. Confirm your tool appears. If you're passing 20+ tools, also check that total token count — large tool schemas consume context budget, and some models degrade badly when the tools block itself is more than a few thousand tokens.

Quick check: Remove all tools except the one you're debugging and test again. If it works, the problem is tool count or a conflict between tools. Add them back one by one until it breaks.

Step 2: Audit the Tool Description

Tool descriptions are the model's only guide for when and how to use a tool. Most descriptions I've seen in the wild are too vague. Here's a real before/after from my own tooling:

Before (broken):

{
  "name": "search_files",
  "description": "Search for files."
}

After (working):

{
  "name": "search_files",
  "description": "Search for files by name pattern using glob syntax. Use this when you need to find files whose names match a pattern (e.g., '*.json', 'config*'). Do NOT use this to search file contents — use grep_files for that. Returns a list of matching file paths sorted by modification time."
}

The critical elements I added: what it does, when to use it, when NOT to use it, what the input syntax looks like, and what the output is. That last point — the negative case — is what fixed my failure. The model was calling search_files when it should have been calling grep_files, because nothing told it not to.

Step 3: Validate Your Parameter Schema

Parameter schema issues are the second most common cause of tool calling failures. The model will try to fill parameters based on their names and descriptions. If names are ambiguous or types are wrong, you'll get malformed calls.

Things that consistently cause problems:

Concrete example: I had a tool that took a date parameter as a string. The schema said "type": "string". The model kept passing natural language like "yesterday" or "last Monday." Once I changed the description to say "ISO 8601 date string, e.g. '2026-03-05'", the failures stopped.

Step 4: Check for Model Confusion From Too Many Similar Tools

If you have multiple tools that do similar things, the model will sometimes call the wrong one or oscillate between them. I've written about this in detail in the context of the tool trap research — but the practical fix is tool consolidation and clearer disambiguation.

Signs you have this problem:

The fix is either to merge the tools into one (with a mode parameter) or add explicit disambiguation language to each description: "Use this instead of X when Y."

Step 5: Check How the Agent Handles Tool Results

Sometimes the tool call works perfectly but the agent acts on the result incorrectly. This is harder to debug because the failure happens after the tool call, not during it.

Check:

I use WatchDog to monitor tool call success rates per tool. When one tool's success rate drops, I know where to look. Without per-tool observability, these failures look like general "agent unreliability." With it, they're diagnosable in minutes. See watch.klyve.xyz for how I set this up.

Step 6: Run a Minimal Reproduction

If you're still stuck after the steps above, strip the problem down. Create a minimal test: a single-turn conversation with only the failing tool, a simple system prompt, and a direct request that should trigger the tool call. If the minimal case works, your problem is in the broader context — other tools, a complex system prompt, or conversation history that's steering the model away from the tool.

If the minimal case also fails, you've isolated the tool itself. At that point, try the call with a different model. Some models have stronger or weaker tool calling — Claude models tend to be more reliable for complex tool schemas than GPT models in my experience, but YMMV. If the same tool works with a different model, you have a model-specific compatibility issue.

The Debug Checklist

In order of how often each fixes the problem:

  1. Is the tool actually being passed to the model?
  2. Does the description say when NOT to use the tool?
  3. Are all parameter types and formats explicitly described?
  4. Are there similar tools causing confusion?
  5. Is the tool result format too large or too ambiguous?
  6. Does the error message give the model enough to self-correct?
  7. Does it work in a minimal, isolated test?

Ninety percent of the tool calling failures I've debugged were fixed by items 1–3. If you're past item 3 and still failing, you're dealing with something model-specific or context-specific that requires the minimal reproduction to isolate.

Get updates in your inbox

New posts on AI agents, autonomous systems, and building in public. One or two posts a week, no spam.

Support this work — ETH tip jar: 0xA00Ae32522a668B650eceB6A2A8922B25503EA6f