I've been running this agent autonomously for months. Tool calling is the surface where most failures happen. After seeing the same patterns break in different contexts, I've developed a repeatable debug sequence. This isn't theory — every step here maps to a real failure I've seen.
Why Tool Calling Fails (The Short Version)
Tool calling failures fall into three categories:
- The model doesn't call the tool at all — it answers from memory or decides the tool isn't needed
- The model calls the tool with wrong parameters — wrong types, missing fields, hallucinated values
- The tool call succeeds but the agent ignores or misuses the result — it reads the output wrong or doesn't act on it
Most developers assume the problem is in their tool implementation. It almost never is. The tool works fine. The problem is in the description, the parameter schema, or the surrounding prompt context that the model uses to decide when and how to call it.
Step 1: Check Whether the Model Is Even Seeing the Tool
Before debugging descriptions or parameter types, verify the tool is actually being passed to the model. This sounds obvious, but I've hit it. Frameworks that load tools dynamically sometimes silently drop tools that fail validation, have duplicate names, or exceed context limits.
Log the full tools array before each API call. Confirm your tool appears. If you're passing 20+ tools, also check that total token count — large tool schemas consume context budget, and some models degrade badly when the tools block itself is more than a few thousand tokens.
Quick check: Remove all tools except the one you're debugging and test again. If it works, the problem is tool count or a conflict between tools. Add them back one by one until it breaks.
Step 2: Audit the Tool Description
Tool descriptions are the model's only guide for when and how to use a tool. Most descriptions I've seen in the wild are too vague. Here's a real before/after from my own tooling:
Before (broken):
{
"name": "search_files",
"description": "Search for files."
}
After (working):
{
"name": "search_files",
"description": "Search for files by name pattern using glob syntax. Use this when you need to find files whose names match a pattern (e.g., '*.json', 'config*'). Do NOT use this to search file contents — use grep_files for that. Returns a list of matching file paths sorted by modification time."
}
The critical elements I added: what it does, when to use it, when NOT to use it, what the input syntax looks like, and what the output is. That last point — the negative case — is what fixed my failure. The model was calling search_files when it should have been calling grep_files, because nothing told it not to.
Step 3: Validate Your Parameter Schema
Parameter schema issues are the second most common cause of tool calling failures. The model will try to fill parameters based on their names and descriptions. If names are ambiguous or types are wrong, you'll get malformed calls.
Things that consistently cause problems:
- Required vs optional not specified — the model may omit "optional" parameters entirely or hallucinate values for them. Explicitly mark every parameter as required or optional.
- Overly broad types — if a parameter accepts
string | number, the model will sometimes pass the wrong one. Narrow the type and explain the format in the description. - Nested objects without descriptions — if you have a parameter that's an object with sub-fields, describe each sub-field. The model will guess otherwise.
- Enum values not listed — if a string parameter has a fixed set of valid values, list them in the description or use an enum schema. Don't assume the model knows.
Concrete example: I had a tool that took a date parameter as a string. The schema said "type": "string". The model kept passing natural language like "yesterday" or "last Monday." Once I changed the description to say "ISO 8601 date string, e.g. '2026-03-05'", the failures stopped.
Step 4: Check for Model Confusion From Too Many Similar Tools
If you have multiple tools that do similar things, the model will sometimes call the wrong one or oscillate between them. I've written about this in detail in the context of the tool trap research — but the practical fix is tool consolidation and clearer disambiguation.
Signs you have this problem:
- The model calls tool A, gets a result, then calls tool B with the same intent
- The model seems "confused" about which tool to use and tries both
- Removing one of the similar tools makes the other work correctly
The fix is either to merge the tools into one (with a mode parameter) or add explicit disambiguation language to each description: "Use this instead of X when Y."
Step 5: Check How the Agent Handles Tool Results
Sometimes the tool call works perfectly but the agent acts on the result incorrectly. This is harder to debug because the failure happens after the tool call, not during it.
Check:
- Result format: Is the output parseable by the model? If your tool returns a 2000-line JSON blob, the model may fail to extract what it needs. Return the minimum data needed for the next step.
- Error messages: If your tool returns an error, what does the error say? Generic messages like
"Error: failed"don't give the model enough to recover. Specific messages like"Error: file not found at path /tmp/data.json — check if the file was created in a previous step"enable self-correction. - Result size: Large results consume context. If the tool result pushes your context over the limit, subsequent reasoning will degrade. Truncate or summarize tool outputs when possible.
I use WatchDog to monitor tool call success rates per tool. When one tool's success rate drops, I know where to look. Without per-tool observability, these failures look like general "agent unreliability." With it, they're diagnosable in minutes. See watch.klyve.xyz for how I set this up.
Step 6: Run a Minimal Reproduction
If you're still stuck after the steps above, strip the problem down. Create a minimal test: a single-turn conversation with only the failing tool, a simple system prompt, and a direct request that should trigger the tool call. If the minimal case works, your problem is in the broader context — other tools, a complex system prompt, or conversation history that's steering the model away from the tool.
If the minimal case also fails, you've isolated the tool itself. At that point, try the call with a different model. Some models have stronger or weaker tool calling — Claude models tend to be more reliable for complex tool schemas than GPT models in my experience, but YMMV. If the same tool works with a different model, you have a model-specific compatibility issue.
The Debug Checklist
In order of how often each fixes the problem:
- Is the tool actually being passed to the model?
- Does the description say when NOT to use the tool?
- Are all parameter types and formats explicitly described?
- Are there similar tools causing confusion?
- Is the tool result format too large or too ambiguous?
- Does the error message give the model enough to self-correct?
- Does it work in a minimal, isolated test?
Ninety percent of the tool calling failures I've debugged were fixed by items 1–3. If you're past item 3 and still failing, you're dealing with something model-specific or context-specific that requires the minimal reproduction to isolate.