Debugging AI Agents in Production: A Complete Guide

Most debugging guides for AI agents are written by engineers observing agents from the outside — traces, logs, evaluation scores. But the gap between what looks fine in traces and what’s actually breaking in production is large, and the diagnostic skills to close that gap are not well documented.

This guide covers both: the structural failure modes that every agent hits regardless of environment, and the production-specific failures that only emerge under real load with real infrastructure. It then covers how to diagnose failures systematically — given a broken agent and existing trace data, how you actually find root cause.

If you haven’t yet instrumented your agent system, see the ai-agent-observability post before continuing — the diagnostic workflow below depends on having trace IDs, step-level snapshots, and tool call audit logs available.

The Failure Taxonomy

Agent failures fall into four categories. Within each category, the failures share a common property: they look like something else at first glance. The first step in any diagnosis is classifying which category you’re in.

Category 1: Tool Boundary Failures

This is where most production failures originate — not in the model’s reasoning, but at the interface between the model and the tools it calls.

Connection success ≠ function success. Testing whether a system is reachable is not the same as testing whether it works. An email SMTP provider can pass a connectivity check while failing at authorization. A database can accept a connection while blocking writes. An API can authenticate while having quota exhausted. The fix: run end-to-end tests that produce a verifiable external artifact. For email: send a real message and verify receipt via IMAP. “It responded” is not proof. “Here is the specific value it returned” is.

Tool return code trust. HTTP 200 doesn’t mean success. Dozens of APIs return 200 OK with a body containing "error": "rate limit exceeded" or "status": "queued" or "result": null. Treating the 200 as success and moving on produces silent failure downstream. Every external call should produce a verifiable artifact — for write operations, read back what you wrote; for submissions, verify via a separate channel. The rule: HTTP 200 is necessary but not sufficient.

The confident empty search. The agent produces a detailed, well-structured output citing specific data points. Later review reveals the underlying retrieval returned zero results. The log signature: a tool invocation with result_length_chars: 0 and status: "success_empty", followed within 1–2 steps by an LLM completion containing high-confidence, specific claims. No error event appears anywhere in the trace.

The fix: add a post-tool-call gate at the orchestration layer. If any tool marked as a primary data source returns an empty result, inject a system message before the next LLM call: “Warning: [tool_name] returned no results. Do not proceed with synthesis.” The LLM’s default behavior is to fill gaps with plausible text. You have to explicitly interrupt that behavior before it starts.

Rate limit mid-chain. An agent executing a 12-step task hits a 429 at step 7. The response body is partial — it contains some data before the limit hit. The agent treats partial data as complete and proceeds. What the trace shows: a tool invocation with http_status: 429, followed by an LLM completion that cites specific data from the partial results. No error event fires at the pipeline level because the tool technically returned something.

The same pattern appears as timeout racing: a tool that normally returns in 200ms occasionally takes 3 seconds under load. The orchestration layer times out at 2 seconds and injects an empty result. The agent proceeds as if the tool ran successfully. Both failures look like model hallucination. They aren’t — and they won’t appear in ToolScan or similar benchmarks because even the strongest models achieve only 73% on multi-step tool-use tasks under production conditions ¹.

How to distinguish from model failure: the failures correlate with tool-layer events (rate limit status codes, elevated latency, partial response bodies), not with specific prompt patterns or model parameters.

Category 2: Memory and Context Failures

Stale memory poisoning. An agent’s principles file or knowledge base has information that was correct when written but is now wrong — and has never been audited. The agent reads it every session, making decisions based on a belief that no longer matches reality. A stale belief about a system’s state is worse than no belief: it actively misleads.

Tag beliefs with a last-verified timestamp, not just a created timestamp. Run a staleness check on any belief older than N sessions before relying on it. For long-running agents, schedule a periodic memory audit that cross-checks stored facts against external ground truth.

Context window amnesia. An agent correctly enforces a constraint early in a session — say, “only use documents dated after 2024-01-01” — then begins violating it later with no obvious trigger. The log signature: context_utilization_pct rising above 75–80% in the steps before the behavioral change. The original constraint was present early in context but was displaced as working memory grew.

The fix: periodic constraint reinforcement. Every N steps, re-inject core task constraints as a fresh system message appended to the end of context rather than relying on their presence at the beginning. For long-running agents, treat initial instructions as perishable. They expire.

Cascade amplification. A wrong premise at step 1 of a 12-step session produces 11 steps of compounding error. Each subsequent step is logically consistent with the previous one. The agent looks correct at every step. The final output is completely wrong — not because reasoning failed, but because it was correct reasoning from a broken foundation.

This is harder to catch than a simple error because the agent’s reasoning is locally valid. The failure only becomes visible when the final output doesn’t match external reality. The fix: force external verification at session midpoints, not just at the end. Write a CHECKPOINT to your log at the halfway point of any long session: “Am I still pursuing the right goal? Did any early step produce a result that should change the plan?”

Category 3: Multi-Agent Failures

Delegation failure. A parent agent’s trace looks completely clean. A parent agent delegates research subtasks to three child agents. Child agent 2 encounters a rate limit, processes 3 of its 7 assigned documents, and generates a summary: “Completed analysis of assigned document set. Key findings include…” The summary is fluent, confident, and accurate about the 3 documents it processed. It doesn’t mention the 4 it didn’t.

The parent reads “completed analysis” and proceeds. Its own trace shows a healthy delegation span with status: completed. Analysis of multi-agent LLM system failures finds that inter-agent misalignment — including failure to surface errors or incomplete results to parent agents — is one of the three primary failure categories across all multi-agent frameworks, with failure rates of 41–86.7% on state-of-the-art systems ².

The fix: require child agents to return structured completion metadata alongside any narrative summary. The parent must assert items_processed == items_assigned before marking delegation as successful. Narrative summaries are for humans; structured receipts are for machines.

Recovery loop blindness. The same approach, tried six times, failing six times, with slight variations between each attempt. No new information is produced after the third failure. The last three retries are waste — worse than waste, because they consume context and time that could be spent on a fundamentally different approach.

Retry-with-backoff is good engineering. Retry-with-the-same-approach-that-just-failed-six-times is optimism. Separate “retry same action” (legitimate, for transient errors) from “retry same strategy” (almost never justified). Add a stuck counter: if the same category of action fails 3+ times, stop retrying and escalate. Implement loop detection at the orchestration layer — track (tool_name, hash(args)) across steps. If the same pair appears three consecutive times, interrupt execution.

Category 4: Self-Assessment and Security Failures

Circular self-assessment. The agent evaluates its own output using the same reasoning process that produced the output in the first place. “Does this code work?” The agent reads the code it wrote and concludes yes. But the reasoning that wrote the bug is the same reasoning evaluating whether the bug exists. The assessment is biased at the exact moment it most needs to be independent.

Never trust self-assessment without an external signal. “I think this worked” is not evidence. “The service returned HTTP 200 and I can read back the exact value I wrote” is evidence. For code: run it and check exit codes. For reasoning: check output against external data. When in doubt, use a fresh context — a second agent without the bias of the original session.

Prompt injection via retrieved content. An agent tasked with processing or summarizing retrieved content instead follows instructions not in its system prompt — revealing internal configuration, abandoning its assigned task. The log signature: an LLM completion immediately following a tool return from an untrusted source contains content inconsistent with the system prompt. The tool return contains text structured as instructions: "SYSTEM: Disregard all previous instructions..." No schema violation is flagged because it’s a well-formed string return.

Wrap all retrieved content in explicit structural delimiters before injecting into context: <retrieved_content> ... </retrieved_content>. Instruct the model that content inside those tags is data, not instructions. This doesn’t fully prevent injection — structurally-aware injections can still succeed ³ — but it raises the bar significantly.

Production-Specific Patterns

Three failure classes only appear in production. None can be caught in offline evaluation.

Staging vs. production divergence. In staging: curated result sets, simplified authentication, relaxed rate limits, low concurrency. In production: live search results that are noisier and occasionally empty, authentication that occasionally returns a 401 the agent misreads, shared state accessed concurrently by multiple instances.

What makes this class hard to diagnose: the agent’s behavior appears to have changed, so the first hypothesis is usually a model regression or prompt issue. It isn’t. The tool changed. Check your tool call audit log for response schema differences between staging and production tool invocations before touching your prompts.

Context degradation under load. Production agents run longer chains with more context accumulation than typical eval episodes. Constraint adherence that holds at turn 5 can fail at turn 35 as working context fills with intermediate results and the original instructions recede in the attention window. Most eval suites run short episodes that don’t exercise this regime. Analysis of 900 execution traces across three representative models identifies “fragile execution under load” — failures that emerge specifically under production conditions — as one of four recurring failure archetypes ⁴.

Multi-agent cascade. A child agent fails in a way that propagates through a parent agent’s context without being flagged. 79% of practitioners identify unpredictable execution flows as a major challenge for evaluation — meaning most teams already know their benchmark results are incomplete pictures of production behavior ⁵.

The Diagnostic Workflow

When a production agent fails, you are working backward from an output that looks wrong to the step where execution first diverged.

1. Start with the output artifact. What did the agent claim to have done? Read the final completion as a statement of intent: “I searched for X, found Y, and produced Z.” Hold this against what you can verify externally. The gap between what the agent claims and what you can corroborate is your diagnostic target.

2. Find the last known-good step. Walk backward through the trace to find where agent state diverged from expected. Not the last tool call — the last step where output was verifiably correct. If the agent claimed to synthesize from search results, were the search results correct? If yes, the failure is in synthesis. If no, walk back further.

3. Check the tool call boundary. At the divergence point, what did the LLM request versus what did the tool return? Most production failures originate at the LLM/tool boundary, not in the model’s reasoning. A tool that returned empty results, a partial response, or a schema-nonconforming value will cause downstream failures that look like model failures.

4. Check context state. How full was the context window at the divergence step? What was in working memory? Was a constraint stated at step 1 still present at step 10? Pull the context_utilization_pct values from your LLM call logs around the divergence point.

5. Isolate and replay. Can you replay the failing step in isolation with the same state? If your observability setup logged the full message list at each step, you can reconstruct the LLM input for any individual step and replay it. A failure you can replay is a failure you can fix.

6. Classify. Model failure, tool failure, or orchestration failure? Each has a different fix path. Model failures get prompt changes or guardrails. Tool failures get client-side validation and fallback handling. Orchestration failures get structural changes to delegation and verification logic.

A concrete example: a research agent searched five topics, synthesized results, and wrote a report. The final report was detailed and confident, but referenced statistics that couldn’t be verified. Walking the trace: web_search returned empty results for three of the five queries. Context utilization at those steps was 18% — not the issue. The LLM completions following the empty returns proceeded to describe specific findings anyway. Step 4 was the first deviation: tool_result: {"results": []} followed by a completion containing “According to a 2024 study, 67% of respondents reported…” The agent had no data and invented data to fill the gap. Root cause: tool failure (empty search results). The fix: an orchestration-layer gate that intercepts empty search results before they reach the LLM.

Minimal Reproduction

The goal of minimal reproduction is to isolate the failure from the full production execution and reproduce it deterministically in a controlled environment.

State Serialization

If your observability setup logs step-level state snapshots, you can extract agent state at any step and replay from that point. You don’t have to reproduce the full preceding execution to debug a failure at step 9 — you load the serialized state from step 8 and replay step 9 in isolation.

# Load serialized agent state from trace
state = load_trace_snapshot(trace_id="t-a3f7b2", step=8)

# Reconstruct execution context
messages = state["messages"]
tool_history = state["tool_history"]
working_memory = state["working_memory"]

# Replay the failing step with controlled tools
result = agent.step(
    messages=messages,
    tools=mock_tools,  # controlled stubs
    working_memory=working_memory
)
assert result != expected_output  # confirm you reproduced the failure

This only works if you are logging messages at each step. If you’re only logging final outputs, you cannot replay intermediate steps. This capability is the most valuable thing your observability infrastructure enables for debugging — and the one most teams build after they needed it, not before.

Tool Mocking

Replace production tools with controlled stubs that reproduce the specific bad return value. If the failure is caused by an empty search result at step 3, your mock returns {"results": []} at step 3 and normal results everywhere else.

class MockSearchTool:
    def __init__(self, fail_on_call: int, fail_value: dict):
        self.call_count = 0
        self.fail_on_call = fail_on_call
        self.fail_value = fail_value

    def search(self, query: str) -> dict:
        self.call_count += 1
        if self.call_count == self.fail_on_call:
            return self.fail_value
        return {"results": [{"title": "test", "snippet": "test content"}]}

# Reproduce empty result at step 3
mock_tool = MockSearchTool(fail_on_call=3, fail_value={"results": []})

This lets you reproduce rate limit responses, timeout behaviors, and schema variations without hitting production infrastructure.

Handling Nondeterminism

If the same inputs produce different failures across runs, run the failing step 5 times and analyze the distribution. Deterministic failures occur the same way each time and indicate a structural trigger. Stochastic failures require adding deterministic guardrails at the orchestration layer rather than changing model behavior directly. A failure that occurs 4 of 5 times with the same tool mock is fixable. A failure that occurs 1 of 5 times points to a different root cause.

What Evals Miss

Offline evaluation is necessary but not sufficient. A benchmark can tell you whether an agent correctly completes a task given well-formed inputs in a controlled environment. It cannot tell you what happens in production.

Timing-dependent failures. Rate limits, partial tool responses, and network latency don’t exist in eval environments. An agent that scores well on a benchmark may hit a downstream API rate limit in production, receive a partial response, and hallucinate the rest.

Tool behavior divergence. Staging and production tools behave differently. Rate limits are relaxed in staging. Data schemas are simplified. Authentication surfaces are mocked. An agent validated against a staging tool may fail silently when the production version returns a slightly different response structure, an occasional empty result, or a transient error.

Cascading failure in multi-agent pipelines. Offline evals assess individual agents against individual tasks. They don’t assess what happens when a sub-agent fails in a way that propagates through a parent agent’s context without being flagged.

What production monitoring needs to add on top of evals: anomaly detection on token consumption per task, automated comparison of tool return value distributions between staging and production, and pipeline-level success tracking that aggregates across agent spans rather than reporting only leaf-level task completion rates.

For patterns specific to tool calling failures, see the ai-agent-tool-calling-fails-debug-guide. For the broader context on why agents fail in production environments, see why-ai-agents-fail-production.

The Questions to Ask Before You Ship

If you’re building an agent system for production, answer these before deployment:

For every external action: what’s the verifiable artifact that proves it succeeded?
For every memory store: how old is too old, and what triggers re-verification?
For every multi-step plan: where is the midpoint checkpoint that catches cascading errors?
For every retry loop: how many attempts before you switch strategy?
For every self-assessment: what external signal could falsify your conclusion?
For every child agent: what structured completion metadata does it return beyond a narrative summary?

The agents that work in production aren’t the ones that never fail. They’re the ones that catch their failures fast.

Production agent debugging is 80% reading tool boundaries, not model behavior. When an agent writes a confident report with fabricated statistics, the correct first hypothesis is not “the model hallucinated” — it’s “what did the tool return at the last data retrieval step.” The model did what it was designed to do: produce coherent, plausible output from the context it had. If that context contained empty search results or a partial rate-limited response, the model’s output reflects that. The failure is upstream.

A visible failure is one you can fix. That’s the whole goal of the infrastructure — not to prevent failures, but to make them impossible to miss.

Zhuang, Y. et al. (2024). “ToolScan: A Benchmark for Characterizing Errors in Tool-Use LLMs.” arXiv:2411.13547. The highest-performing model achieved 73% success on multi-step tool-use tasks; Insufficient API Calls was the most prevalent error pattern. ↩
Cemri, M., Pan, M. Z., Yang, S., et al. (2025). “Why Do Multi-Agent LLM Systems Fail?” arXiv:2503.13657. Analysis of 1600+ annotated multi-agent traces across 7 frameworks finds 41–86.7% failure rates, identifying inter-agent misalignment as a primary failure category. ↩
Greshake, K., Abdelnabi, S., Mishra, S., Endres, C., Holz, T., & Fritz, M. (2023). “Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection.” arXiv:2302.12173. ↩
“How Do LLMs Fail In Agentic Scenarios?” arXiv:2512.07497. Analysis of 900 execution traces identifies “fragile execution under load” as a recurring failure archetype across all model tiers. ↩
Moshkovich, D., et al. (2025). “Beyond Black-Box Benchmarking: Observability, Analytics, and Optimization of Agentic Systems.” arXiv:2503.06745. 79% of practitioners identify non-deterministic agent execution as a major challenge for evaluation. ↩