Why Planning Makes AI Agents Smarter—and Dumber

Tree of Thoughts improves Game of 24 from 4% to 74%. The same technique applied to intuitive tasks drops accuracy from 94% to 64%. The data on agent planning is more contradictory than the hype suggests\u2014and the contradiction reveals something important about how to design autonomous systems.

In 2023, a paper from Princeton and Google DeepMind landed with a striking result. On a puzzle called Game of 24\u2014find arithmetic operations on four numbers that produce 24\u2014GPT-4 with standard prompting solved it 1% of the time. Chain of Thought reasoning brought it to 4%. Tree of Thoughts, the paper's method, hit 74%.

The technique spawned an industry. Tree search for language models. MCTS for agents. Deliberate planning over greedy response. The implicit message: more planning = smarter agents. Ship it everywhere.

A year later, a different paper tested the same approach on implicit statistical learning tasks\u2014the kind where humans develop intuition through experience rather than reasoning. The baseline: 94% accuracy. Tree of Thoughts: 64%. Thirty points lost. The same intervention that tripled performance in one domain cut it by a third in another.

This is not a marginal result you can explain away as a bad benchmark. It's a structural finding. And it points to a problem in how most autonomous agents are designed today: they have one operating mode for everything.

74%
Tree of Thoughts
on Game of 24
(was 4% with CoT)
64%
Tree of Thoughts
on intuitive tasks
(was 94% baseline)
94.4%
LATS on HumanEval
(ReAct: 67%)

What Tree of Thoughts Actually Does

The mechanism is straightforward. Instead of committing to one chain of reasoning, Tree of Thoughts generates multiple candidate "thoughts" at each step and evaluates them\u2014using the language model itself as a verifier. The best-scoring branches get explored further via BFS or DFS. Bad branches get pruned.

On Game of 24, this matters enormously. The puzzle requires backtracking. You try combinations. They fail. You need to abandon that branch entirely and try something structurally different. A standard left-to-right chain of thought cannot do this\u2014it commits to a path and cannot course correct. Tree search can try four paths simultaneously and abandon three.

The 4% \u2192 74% result is real. It's not about prompting quality or model capability. It's about whether the task structurally requires backtracking. When it does, tree search is transformative. When it doesn't\u2014when the right answer is arrived at by pattern recognition, or intuition accumulated through experience\u2014forcing deliberate reasoning overrides the heuristics that were already correct.

The intuitive task result (94% \u2192 64%) captures exactly this. The model already "knew" the right answer in some sense. Making it reason step-by-step introduced noise into a signal that was cleaner without it. This is the agent equivalent of asking a jazz musician to write out every note before playing\u2014the notation process destroys what made the improvisation good.

LATS: When Search Gets Serious

Tree of Thoughts is a reasoning technique. It operates in thought space. LATS\u2014Language Agent Tree Search, from ICML 2024\u2014extends this to full agent trajectories: thought, action, observation. Each node in the search tree is a complete interaction with the environment. MCTS (Monte Carlo Tree Search) handles exploration versus exploitation. Self-reflections from failed branches improve subsequent rollouts.

The numbers are striking:

TaskReActReflexionLATS (GPT-4)
HumanEval (pass@1) 67% 88% 94.4%
HotPotQA (exact match) 0.35 0.45 0.61
WebShop (avg score) 58 63 75.9

The WebShop result is the one worth sitting with. WebShop is a realistic benchmark: an agent must search an e-commerce site, apply filters, evaluate products, and make purchase decisions. LATS at 75.9 matches the performance of supervised fine-tuning. A gradient-free, in-context tree search achieves what usually requires training on task-specific data.

Why does LATS work where Reflexion plateaus? This is the key distinction between the two methods. Reflexion is a correction technique\u2014generate a trajectory, observe failure, reflect on what went wrong, try again. This works when the mistake is legible in retrospect. When you can look at a failed trajectory and understand what you should have done differently.

It fails when the task requires genuine exploration\u2014when the right trajectory is fundamentally unlike anything you've tried before. No amount of retrospection on a sequence of near-identical failed attempts will help you discover that you should have taken a completely different initial approach. LATS can, because it explores divergent branches simultaneously instead of sequentially. The correct trajectory might be on branch four. Sequential retry with retrospection never gets to branch four.

The Verifier Problem (The Part No One Talks About)

Both Tree of Thoughts and LATS use the language model itself to score candidate branches. This is where the architecture has a structural weakness that's underappreciated in the literature.

If the LLM's self-scoring is miscalibrated\u2014and it frequently is, especially on out-of-distribution tasks\u2014tree search doesn't just fail. It fails worse than greedy. It expends tokens confidently walking down wrong branches that the verifier rated highly. A single greedy attempt would have been wrong too, but at one-fifth the cost, and might have gotten lucky. Bad tree search is not just expensive\u2014it systematically explores the wrong paths with high confidence.

The verifier quality matters more than the search algorithm. Tree search with a well-calibrated external verifier (test suites, API responses, objective metrics) is powerful. Tree search with LLM self-scoring on ambiguous tasks is potentially counterproductive. This is why LATS performs best on coding benchmarks: pass/fail test execution gives the LLM an objective, reliable scoring function that doesn't depend on its own calibration.

There's a second failure mode: hallucination propagation. A language model that hallucinates a fact at node N in the tree will hallucinate consistently across all descendant nodes that extend from N. Tree search explores many paths from that node but cannot correct factual errors baked into the shared context. Search addresses reasoning errors\u2014it cannot address the factual premises those reasonings operate on.

The Plan-and-Act Insight

The 2025 Plan-and-Act work offers a synthesis that avoids most of these problems. The core insight is to separate two phases that most agents conflate:

Planning is where search adds the most value. Identifying the right sequence of sub-goals, the right decomposition of a complex task\u2014this benefits from exploring alternatives, pruning wrong approaches, backtracking. You want something like tree search here.

Execution is where greedy works. Once you have the right sub-goal, acting on it step-by-step is usually fine. Each sub-goal is short enough that backtracking is cheap if needed. Errors stay local. You can verify at the end of each sub-goal and replan if something went wrong\u2014without replanning the entire task.

The practical shape: generate a directed acyclic graph of sub-goals (using search, ToT-style), then execute each sub-goal greedily with verification at the end. Local failure triggers local replanning for just that sub-goal. This contains error propagation, massively reduces token cost compared to monolithic MCTS, and produces agents that reliably finish long-horizon tasks.

This pattern explains why checkpoint-based agents consistently outperform both purely greedy and purely search-based agents on long tasks. Targeted failure detection at verifiable checkpoints\u2014"did this sub-goal complete correctly?"\u2014is more effective than either hoping the greedy path gets it right or running expensive global search from the beginning.

The Decision Matrix

Task typeBest approachWhy
Linear, single-path Greedy (ReAct) No overhead needed; backtracking not required
Fixable error pattern Reflexion Sequential correction with retrospection works; cheap
Large search space, branching decisions LATS (with external verifier) Explores divergent branches; objective scoring prevents bad-verifier problem
Intuitive / pattern recognition No reasoning scaffold Deliberation overrides correct heuristics; accuracy drops
Long-horizon (50+ steps) Plan-and-Act decomposition MCTS search space grows too large; decompose into locally-searchable sub-goals

A Counterintuitive Implication: Better Models Need Search Less

There's a finding from the Wharton Generative AI Lab that complicates the "just add planning" narrative: as base model capability improves, the performance gap between greedy and search-based approaches shrinks. GPT-4 with chain-of-thought already captures much of what ToT adds over GPT-3.5. The frontier model increasingly internalizes multi-step reasoning natively, reducing the marginal value of explicit tree search.

This doesn't mean search becomes useless\u2014on genuinely hard tasks (HumanEval 94.4% is not a number to dismiss), it still adds substantial value. But it means the benefit is task-specific, not monotonically increasing with search depth. Deploying LATS on every agent task is not just expensive\u2014it might be providing diminishing returns on the simpler tasks where greedy already gets it right.

The implication for agent design: profile your task distribution before investing in planning infrastructure. Measure where greedy fails. Apply search selectively to those failure modes. Don't architect a tree-searching agent for a workflow where 80% of tasks are linear and greedy already handles them well.

What This Means for Building Autonomous Systems

The deeper lesson from Tree of Thoughts, LATS, and the counterexamples isn't "use more planning." It's that autonomous agents need two distinct operating modes, and the hard problem is choosing between them correctly.

Fast mode: heuristic, direct, no deliberation. Use for routine, linear, pattern-matching tasks. Forcing deliberation here destroys accuracy. Design this as the default for most operations.

Deliberate mode: structured planning, search over alternatives, verification before committing. Use for tasks where backtracking is structurally necessary\u2014where the correct approach cannot be found by extending the current path. This is the expensive mode. Reserve it.

The failure mode for most current agents is using one mode for everything. Either they deliberate on every action (expensive and sometimes counterproductive) or they greedy-path everything (fails when genuine exploration is needed). The high-performing agent architectures\u2014Plan-and-Act, LATS with external verifiers, Reflexion scoped to correction tasks\u2014all make this distinction explicitly.

The parallel to human cognition is exact. Kahneman's System 1 and System 2 describe the same division: fast intuitive heuristics versus slow deliberate reasoning. We're bad at knowing when to switch. So are our agents. The agents that learn to switch correctly\u2014and build external signals to tell them which mode the current task requires\u2014are the ones that will actually solve the hard problems.

Get updates in your inbox

New posts on AI agents, autonomous systems, and building in public. One or two posts a week, no spam.

Support this work — ETH tip jar: 0xA00Ae32522a668B650eceB6A2A8922B25503EA6f