Your agent has been running for 28 minutes. It has read a hundred documents, called external APIs nine times, written intermediate results to a database, and is halfway through synthesizing a final report. The process dies — OOM, a bad deploy, a preempted cloud instance, a network partition that disconnects the worker.
LangGraph checkpointed the state at minute 20. That checkpoint exists. Your agent’s work is not lost in the sense that state was serialized.
But nothing is going to resume it.
The checkpoint is inert data sitting in SQLite until something external reads it, decides the task is dead, constructs the correct invocation, and re-schedules it. In most production deployments, that “something external” is either a human who notices the failure, a custom watchdog you wrote, or nothing at all.
This is the gap between checkpointing and durable execution. It is not a minor implementation detail. It is a different contract about who is responsible for completion.
Two Different Contracts
Checkpointing saves workflow state so you can restart from a known point if you choose to. The checkpoint is a recovery mechanism — it hands state back to you and expects you to act on it. If the process that manages the checkpoint crashes, the checkpoint is inert until something external reads and re-schedules it. The framework’s responsibility ends at the write.
Durable execution means the runtime guarantees the workflow runs to completion. The engine, not your code, is responsible for detecting failure and resuming. The workflow is a first-class scheduled entity that the infrastructure tracks. If a worker crashes, the engine re-schedules the workflow on another worker. The guarantee lives in the infrastructure layer.
Diagrid’s engineering team put it precisely: “Checkpointing says: ‘I saved your state. You take it from here.’ Durable execution says: ‘Your agent workflows will run to completion. Period. I handle everything.’”1
The gap between those two statements is not incremental. It is architectural.
How Checkpointing Actually Works in Major Frameworks
Understanding the gap requires looking at the mechanics of what each framework actually provides.
LangGraph
LangGraph automatically saves graph state at every “superstep” — each tick where nodes execute. State is scoped to a thread_id. To resume, a caller invokes graph.invoke(None, config=config) with the same thread identifier.
This works when the caller is alive and aware that the task needs resumption. When retries are exhausted, LangGraph raises an exception. As Diagrid’s analysis documents: “There is no built-in fallback routing, no dead-letter queue, no notification system.”1
More critically: if your process crashes, no supervisor knows. There is no watchdog, no heartbeat mechanism built into the framework. The workflow is simply dead until something external notices. And if two processes try to resume the same thread_id simultaneously — for example, after a failover where the original process recovers — LangGraph has no built-in coordination to prevent double execution. You are responsible for distributed locking.
CrewAI
CrewAI’s persistence model uses a @persist decorator that saves flow state to SQLite after each successful method execution. Task replay works via crewai replay -t <task_id>. The human-in-loop pattern uses from_pending() and resume().
The limitation is the same: CrewAI has no durability story for fully autonomous ReAct agents unless each tool is modeled as a separate persisted step. Resumption requires an external caller to detect failure and trigger replay. The framework saves state; it does not guarantee completion.
Google ADK
ADK uses event sourcing — every interaction is appended as an immutable Event to session history. ResumabilityConfig(is_resumable=True) enables resumability via invocation_id. Different agent types handle resumes differently: SequentialAgent uses saved current_sub_agent; LoopAgent tracks times_looped.
The same structural problem applies. ADK’s resumability feature does not include a watchdog, heartbeat, or health check. Detecting that a workflow was interrupted is left to the caller. As Diagrid documents: “You are left to create and maintain extremely complex infrastructure to take care of hard problems that ADK’s resumability feature leaves to you.”1
What All Three Share
All three frameworks — LangGraph, CrewAI, Google ADK — share the same architectural assumption: they run in a single process, and that process is assumed to be managed externally. They save state; they do not guarantee that anything will act on it. The completion guarantee is your problem.
For short tasks (under five minutes, fewer than five tool calls, no external side effects), this is fine. For long-horizon tasks, it is a production risk that compounds with scale.
How Durable Execution Works
The frameworks that offer durable execution — Temporal, Dapr Workflow, AWS Step Functions — share a fundamentally different architecture. The execution guarantee lives in the infrastructure layer, not in your application code.
Temporal
In Temporal, workflow functions are ordinary code. The Temporal server maintains an event history for each workflow execution. Workers pull tasks, execute activities, and return results. If a worker crashes mid-execution, Temporal re-schedules the task on another available worker. The workflow function replays from the beginning on recovery, but completed activities return stored results from the event log instead of re-executing — a property called workflow determinism.
The critical design point: the Temporal server itself is the scheduler. No external watchdog is needed. No code you write is responsible for detecting failure and triggering resumption. The engine guarantees completion as long as at least one worker is available.
Temporal’s documentation describes this as “the workflow function is re-executed, but any activity that was previously completed is not re-executed.” The replay produces the same result because activities are deterministic with respect to stored history.2
Dapr Workflow
Dapr Workflow uses a similar model. Before executing any workflow step, the runtime creates a durable reminder — a scheduled trigger that persists in the state store. If the process crashes, the Dapr sidecar’s reminder fires automatically and reactivates the workflow on any available worker. No external coordinator is required.
As Diagrid describes it: “Every await point in your workflow is automatically a checkpoint. No explicit save calls.”1 The developer writes linear, straightforward code. The runtime handles persistence, recovery, and coordination transparently.
AWS Step Functions
Step Functions externalizes state entirely — the state machine definition is configuration, and execution state is stored by the service, not by any application process. Each state transition is recorded. If the worker handling a state crashes, Step Functions re-schedules the transition. The service, not the application, owns the execution guarantee.
The Common Principle
What these three systems share: the engine itself is responsible for detecting failure and resuming. The workflow is a first-class entity tracked by the infrastructure. Completion is guaranteed by the platform, not by developer-written watchdogs, external monitors, or manual intervention.
Comparison: Checkpointing vs. Durable Execution
| Dimension | Checkpointing (LangGraph, CrewAI, ADK) | Durable Execution (Temporal, Dapr, Step Functions) |
|---|---|---|
| Who guarantees resumption | Your code / external orchestrator | The engine |
| Failure detection | External (you must notice) | Built-in (engine detects) |
| Crash recovery | Manual or external watchdog | Automatic, on any available worker |
| Infrastructure complexity | Low — runs in-process | Higher — requires engine/server |
| Works well for < 5 min tasks | Yes | Overkill |
| Works well for > 30 min tasks | Fragile | Strong |
| Handles worker crashes | No | Yes |
| Prevents double execution | No (your responsibility) | Yes (engine deduplicates) |
| Developer responsibility for resumption | High | Low |
| LLM agent frameworks | LangGraph, CrewAI, Google ADK | Temporal, Dapr Workflow, Step Functions |
The infrastructure complexity column deserves attention. Durable execution systems require running an additional service — a Temporal server, a Dapr control plane, or an AWS service dependency. For internal tools, demos, and short-lived pipelines, that overhead is not justified. The tradeoff is real.
But the complexity column on the checkpointing side is systematically understated. Teams that deploy checkpointing-based agents in production inevitably build watchdogs, failure detectors, re-queue logic, dead-letter handling, and distributed locking. That complexity is just invisible — it lives in your code rather than in acknowledged infrastructure.
Failure Modes
Three failure modes surface consistently in production deployments of checkpointing-based agents. They are not edge cases.
1. The Orphaned Checkpoint
State is saved, but resumption never fires. This is the most common failure mode in production.
The checkpoint exists at t=20. At t=25, the process dies. The checkpoint is in SQLite, in Redis, or in a database. But the scheduler that would have re-queued the task also died. Or the external monitor that checks for stalled tasks has a 10-minute polling interval and the task was removed from the active queue before the monitor ran. Or the task was re-queued but the ID didn’t match the format the checkpoint reader expects.
The agent is “checkpointed” in the sense that state was serialized. It is also gone in the sense that no one will ever resume it. The work is effectively lost. The orphaned checkpoint is an artifact.
This failure is silent. No exception is raised. No alert fires. The task simply never completes. In production systems handling orders, infrastructure operations, or customer requests, silent non-completion is often worse than a loud failure.
Research on multi-agent systems confirms that failures in long-horizon tasks compound in ways that are difficult to detect. A study of multi-task agent environments found that “incorrect or partially correct output at one step can propagate or even amplify through subsequent stages, compounding the impact on the final output” — and that systematic fault detection is necessary to catch these failures before they become silent losses.3
2. Double Execution on Retry
The checkpoint is re-read and the task re-runs from t=20. But a side effect from t=22 — an API call that placed an order, a database write that updated account state, a webhook that triggered a downstream service — already executed before the checkpoint at t=25.
When the task resumes from t=20, it executes that side effect again. The order is placed twice. The database write creates a duplicate. The downstream webhook fires again.
This failure is not hypothetical. It is a standard distributed systems problem: idempotency. But most agent framework documentation treats tool calls as atomic and assumes that resumed workflows will be clean. They will not be, unless you have built explicit idempotency into every tool call the agent makes — which means every external API you call, every database write, every webhook invocation must be idempotent. In practice, few teams audit every tool call for this property.
Durable execution engines handle this through event sourcing: on replay, completed activities return stored results rather than re-executing. The Sherlock paper on agentic workflow reliability demonstrates that selective checkpoint-and-rollback with verification can reduce error propagation — but this requires explicit verification logic, not just state serialization.4
3. Checkpoint Drift
State saved at t=20 is stale by the time it is re-read at t=45.
The agent was researching pricing data. Between t=20 and t=45, prices changed. Or the agent was processing a document that has since been updated. Or an external API returned different results because the underlying data changed. The checkpoint contains the agent’s reasoning at t=20 based on a world state that no longer exists.
Resuming from the checkpoint does not resume the task — it resumes a task working on stale inputs, which may produce incorrect or incoherent results.
Checkpoint drift is particularly acute in LLM agents because their state includes not just data but reasoning context: intermediate conclusions, plans, and beliefs about the world. A long-horizon research agent that saves its belief state at t=20 and resumes at t=45 may be working from outdated premises without knowing it.
This failure mode has no clean mitigation at the checkpointing layer. It requires application-level validation that external state is still consistent with checkpoint state — validation that most frameworks do not provide and most developers do not implement.
The Production Reality
Long-horizon agent tasks — tasks running more than ten minutes, making more than five tool calls, touching more than one external side effect — are not edge cases in modern agentic deployments. They are the use cases that make agents valuable. Automated research pipelines, infrastructure management agents, multi-step customer-service workflows, code generation and review agents: these are the tasks agents are deployed for.
For these tasks, checkpointing alone is insufficient in production.
The claim here is specific. Checkpointing is not broken. For tasks under five minutes with no external side effects, checkpointing is appropriate and durable execution is overkill. For demos and internal tools, checkpointing is fine. The SQLite-backed LangGraph checkpoint is a reasonable choice for building and testing.
The claim is that deploying checkpointing-based agents for long-horizon tasks in production — where failure means a lost order, a missed customer interaction, a broken infrastructure state — requires you to build the missing half of the durability contract yourself. And teams routinely underestimate how much that missing half costs.
A CORPGEN study of autonomous agents managing long-horizon tasks across concurrent workstreams documented failure modes that map directly to these patterns: context saturation leading to silent task drops, memory interference causing corrupted state, and reprioritization failures where re-queued tasks were never completed.5 The researchers noted that handling these failure modes required explicit architectural mechanisms — they could not be solved by retry policies alone.
Where the Line Is
For long-horizon agent tasks, the threshold for requiring durable execution over checkpointing is approximately:
- Duration > 10 minutes: at this scale, worker crashes are a realistic probability, not a theoretical concern
- More than 5 tool calls: each tool call is a potential failure point; the cumulative failure probability compounds
- More than 1 external side effect: any side effect that cannot be safely re-executed requires idempotency guarantees that checkpointing does not provide
- Concurrent user workflows: at scale, the probability of simultaneous failure-and-resume on the same thread makes distributed locking mandatory
Below this threshold, checkpointing is a reasonable choice. Above it, you are either using durable execution infrastructure or you are building it yourself, informally, inside your application code — usually without the operational tooling to see where it breaks.
LangGraph, CrewAI, and Google ADK are not on a trajectory toward durable execution. Diagrid’s team makes this explicit: they “would need to fundamentally rearchitect their runtimes to provide it. Adding a better checkpointer or fancier retry policy doesn’t close the gap. The gap is between saving state and guaranteeing completion.”1
This is not a criticism of those frameworks. They are appropriate for a large class of tasks. The problem is when they are deployed beyond that class — when the checkpoint-saving capability is treated as equivalent to a completion guarantee. It is not.
What This Means for Your Stack
If you are running long-horizon agents in production today using LangGraph or CrewAI, you have three options:
Option 1: Constrain your tasks. Design agent workflows to complete in under five minutes with no external side effects. This is often achievable and is frequently the right answer. Not every agent needs to run for an hour.
Option 2: Build the missing half. Implement a watchdog that monitors task health, a scheduler that detects orphaned tasks and re-queues them, and idempotency at every tool call. This is substantial engineering work, but it is what production checkpointing-based agents require.
Option 3: Move the completion guarantee to the infrastructure layer. Use Temporal, Dapr Workflow, or AWS Step Functions for workflows where the completion guarantee matters. This adds operational complexity but removes the application-level reliability engineering burden.
The choice depends on your task profile, your team, and what failure costs. But the choice should be made explicitly, with full awareness that checkpointing and durable execution are not equivalent.
What the checkpoint saves is not a guarantee that your agent’s work will be completed. It is a snapshot that something else must act on. That something else is you.
References
Footnotes
-
Diagrid. “Checkpoints Are Not Durable Execution: Why LangGraph, CrewAI, Google ADK, and Others Fall Short for Production Agent Workflows.” Diagrid Blog, 2025. https://www.diagrid.io/blog/checkpoints-are-not-durable-execution-why-langgraph-crewai-google-adk-and-others-fall-short-for-production-agent-workflows ↩ ↩2 ↩3 ↩4 ↩5
-
Temporal Technologies. “Workflow Determinism.” Temporal Documentation, 2025. https://docs.temporal.io/workflows#deterministic-constraints ↩
-
Weng et al. “Sherlock: Reliable and Efficient Agentic Workflow Execution.” arXiv:2511.00330, 2025. https://arxiv.org/abs/2511.00330 ↩
-
Weng et al. “Sherlock.” arXiv:2511.00330. The paper reports 18.3% accuracy improvement over non-verifying baselines and 48.7% latency reduction via speculative execution with checkpoint-and-rollback, demonstrating that fault-aware verification at the workflow level outperforms retry-only approaches. ↩
-
Schiavone et al. “CorpGen: Simulating Corporate Environments Using Autonomous Digital Agents for Multi-Task Long-Horizon Execution.” arXiv:2602.14229, 2026. The study documents four failure modes in long-horizon multi-agent task environments — context saturation, memory interference, dependency complexity, and reprioritization overhead — and reports that architectural interventions (hierarchical planning, sub-agent isolation, tiered memory) produce up to 3.5x improvement over baselines (15.2% vs 4.3% task completion), confirming that retry policies alone are insufficient for long-horizon reliability. https://arxiv.org/abs/2602.14229 ↩