Also known as: ToT, tree of thoughts, search-augmented reasoning
TL;DR
Tree-of-thought (ToT) generalizes chain-of-thought by exploring a search tree of reasoning paths instead of a single linear chain. The model branches into multiple candidate next steps, evaluates them, and backtracks when a branch goes wrong.
Tree-of-thought (ToT) is the generalization of chain-of-thought from a single linear reasoning chain to an explicit search tree. Where CoT commits to one trajectory and runs to the end, ToT lets the model branch into multiple candidate next steps, score each branch, expand the promising ones, and backtrack from dead ends. It’s classical state-space search where each node is a “thought” produced by the LLM .
The basic loop
A ToT system has four moving parts:
State representation. Each node is a partial reasoning trace — a sequence of thoughts produced so far.
Expansion. Given a state, the model proposes K candidate next thoughts (sampled with non-zero temperature).
Evaluation. The model (or a separate evaluator) scores each candidate — sometimes as a value heuristic, sometimes via voting across multiple judgments.
Search policy. BFS, DFS, or beam search over the frontier, expanding the highest-scored nodes and pruning the rest.
When a branch reaches a terminal state (a final answer, or a verifiable goal), the search returns. When it reaches a contradiction or low-value state, it backtracks.
Why it beats linear CoT on hard problems
Linear CoT is greedy — once the model commits to a wrong intermediate step, it usually elaborates from there, accumulating error rather than recovering. ToT’s branching makes wrong steps cheap: the search tries multiple alternatives at each decision and the bad ones get pruned. On problems with hard, verifiable subgoals (the canonical example is the Game of 24), ToT can solve cases that CoT — even with self-consistency over many samples — never gets right.
The evaluator is the heart of ToT — a bad evaluator turns the search into expensive random walking. The original ToT paper used the same LLM in two modes: as a value heuristic (“rate this state from 1-10 for likelihood of leading to a solution”) and as a vote across multiple sampled judgments.
Three practical evaluator patterns:
Verifier-style. The evaluator is a small classifier trained to predict whether a partial trace leads to a correct answer. Cheaper than LLM-as-judge, more accurate when training data is available.
LLM self-evaluation. The same LLM scores its own branches with a separate prompt. Cheap to set up; correlates poorly with truth on hard problems (the model is just as confused on evaluation as on generation).
External verifier. A symbolic checker (math equation solver, code interpreter, theorem prover) evaluates intermediate states deterministically. The strongest signal when available, which is why ToT shines on Game of 24 and code synthesis.
When the evaluator is weak, search depth hurts more than it helps — bad scores send the search down dead ends faster than CoT would have committed to one chain.
The cost
CoT does ~1 forward pass per token of one chain; self-consistency does N chains; ToT can easily do 50-200x the forward passes of a single CoT chain — branching factor K at each of D depth levels times the evaluator calls. That rules it out for production hot paths; ToT lives in research and high-stakes single-shot tasks where the answer justifies seconds-to-minutes of inference.
Production-relevant variants
In practice, “tree-of-thought” rarely shows up by name in production stacks. What does show up:
ToT relatives in the wild
Beam search over CoT — keep the top-B partial chains, drop the rest. A simplified ToT.
Search-augmented agents — agent loops that branch on tool-call decisions, with the agent acting as the search controller.
Verifier-guided generation — generate K candidates with CoT, score with a learned verifier, return the best. The cheapest practical approximation of ToT’s evaluator step.
Reasoning-model deliberation — frontier reasoning models (o-series, Claude’s extended thinking) internally do something resembling ToT during their hidden deliberation budget.
MCTS over reasoning — research-stage systems that run Monte Carlo tree search over reasoning steps, with rollouts as the value estimate.
Once you accept that test-time compute buys quality, ToT is the structured way to spend it. Self-consistency is the budget option; ToT is the deluxe one. Most production systems live closer to self-consistency, with ToT-style search reserved for problems where a single answer is worth seconds of LLM time.
Go further
How is tree-of-thought different from self-consistency?
Self-consistency samples N independent linear chains and votes on the final answer — no interaction between chains. ToT explicitly builds a tree: each node is a partial reasoning state, the model expands the most promising frontier nodes, scores them, and prunes dead ends. ToT is more expensive but handles problems where you need to recover from a wrong intermediate step, which self-consistency can't do within a single chain.
Combinatorial puzzles with verifiable intermediate states (Game of 24, Sudoku, crosswords), planning problems with subgoal decomposition, and multi-step proofs where wrong early steps poison the rest of the chain. On tasks where chain-of-thought is already 90%+ accurate, ToT just burns tokens. The original ToT paper showed dramatic gains specifically on tasks where standard CoT was below 10%.
Conceptually similar but operationally different. Agent search (typified by ReAct loops and modern coding agents) interleaves reasoning with tool calls and external observations; ToT searches purely over the model's own reasoning tokens. Agents can verify steps against the world; ToT verifies steps against the model's own evaluator. The two compose — many production agents use ToT-style search inside an outer agent loop.