Chain-of-Thought

Also known as: CoT, step-by-step prompting, reasoning prompt

TL;DR

Chain-of-thought (CoT) prompting asks the model to produce intermediate reasoning steps before its final answer. The intermediate tokens act as a scratchpad.

Chain-of-thought (CoT) is the prompting technique of asking a model to produce intermediate reasoning before its final answer. The canonical trigger is appending “Let’s think step by step” to the prompt; the canonical effect is that accuracy on math, logic, and multi-step reasoning tasks jumps — sometimes by tens of percentage points.

The mechanism

Each token an LLM generates uses the same fixed amount of compute. A complex one-shot answer (“the answer is 391”) forces the model to do the entire reasoning silently inside one forward pass. By instead generating a reasoning chain — “17 × 23 = 17 × 20 + 17 × 3 = 340 + 51 = 391” — the model gets to:

Spend more total compute (more tokens = more forward passes).
Decompose the problem into easier next-token predictions, each conditioned on the previous step.
Recover from local mistakes by re-reading its own intermediate state.

The technique exploits the autoregressive structure of the model. It’s not “reasoning” in any cognitive sense — it’s compute amortization plus error correction.

Where it helps disproportionately

Multi-step math. The hallmark CoT use case. Going from “predict the final number” to “produce arithmetic chain, then final number” turns a hard task into a manageable one.
Logic puzzles, planning, multi-hop QA. Anything where the answer depends on chaining several facts.
Code reasoning. “What does this function return given these inputs?” benefits from stepping through the code.
Long-context analysis. Asking the model to enumerate evidence before drawing a conclusion improves grounding.

Where it doesn’t help (or hurts)

Simple recall. “What year did WWII end?” doesn’t get better when you ask the model to think first.
Classification with strong priors. Sentiment, topic, intent — these are pattern-match tasks the model nails in one token.
Calibration-sensitive judgments. A reasoning chain can talk the model into a confident wrong answer; pointwise classifiers without CoT are often better-calibrated.

Common variants

Zero-shot CoT. Just the trigger phrase. Cheapest unlock.
Few-shot CoT. Include 2-3 worked-example demonstrations with their reasoning chains in the prompt . Stronger but more tokens.
Self-consistency . Sample N independent CoT chains and majority-vote the final answer.
Tree-of-thought . Explicit search over branching reasoning paths with backtracking.
ReAct . CoT interleaved with tool calls — the foundational agent loop.

The production frame

CoT is a test-time-compute lever: more tokens, more accuracy, more cost and latency. For consumer chat that latency is acceptable; for high-volume backend tasks (retrieval ranking, classification, query rewriting) the cost-per-call usually rules CoT out — which is why specialized fine-tuned models that don’t need to “think out loud” dominate the production hot path.

CoT isn’t reasoning. It’s compute amortization plus error correction, dressed up as introspection.

Two effects, and they compound. First, the model’s per-token compute is fixed. By producing more tokens, the model gets more total forward passes — more raw compute applied to the problem. This is why CoT sometimes helps even when the chain is gibberish — researchers have found that filler tokens between question and answer can improve accuracy almost as much as actual reasoning steps.

Second, each intermediate token is conditioned on the previous tokens. “What’s 17 × 23?” is a hard one-shot prediction. “17 × 23 = 17 × 20 + 17 × 3 = 340 + 51 = 391” decomposes it into a sequence of next-token predictions the model has high confidence on at each step.

The corollary is that CoT helps where the problem decomposes naturally and hurts where it doesn’t. Multi-step arithmetic decomposes; a fact lookup doesn’t. The harder the one-shot prediction relative to the average step in a chain, the more CoT pays off.

Approximately, with two important wrinkles. First, the chains in trained reasoning models are typically much longer than human-written CoT prompts elicit — thousands of tokens of internal deliberation per response. The model learned during RL post-training that long chains earn higher reward on its training distribution, and it generates accordingly.

Second, the chains often look different from human-style reasoning. They include backtracking (“wait, that doesn’t work, let me try…”), self-checking, exploration of multiple approaches, and dead-end recovery. This isn’t an emergent property — it’s directly trained for, because the RL reward function favored chains that landed on correct answers. The model learned that “exploring then committing” beats “committing then defending”.

The third wrinkle is opacity. Many reasoning models hide the chain from the user; you see the final answer plus maybe a summary of the reasoning. This is partly a product decision (chains are noisy and hard to read) and partly a competitive one (the chain is a window into the model’s training, which providers want to obscure). For non-reasoning models, explicit CoT prompting gets you a substantial fraction of the gain at vastly lower training cost, but with the chain visible and the depth bounded by what the prompt can elicit.

Retrieval and ranking are calibration-sensitive in a way that open-ended generation isn’t. A reranker is judged on whether its top-K ordering matches relevance, and the score has to be comparable across queries (well-calibrated) for downstream stop conditions and confidence thresholds to work.

CoT in this setting introduces three problems. First, latency — a CoT-based ranker takes 10× longer per (query, document) pair than a pointwise classifier. At 100 candidates per query, that’s the difference between 100ms and 10s. Second, cost — the per-token decoding bill scales with chain length. Third and most insidious, calibration drift — the elaborated reasoning pushes the model toward more extreme score distributions (everything is 0.95 or 0.05), which destroys the calibrated middle that production thresholds depend on.

The general principle: when you need a calibrated number, train a model to produce it in one forward pass; don’t ask an LLM to think aloud about it.

Go further

Why does writing the reasoning out actually help?

Two effects. First, the model's per-token compute is fixed — by producing more tokens, it gets more total compute on the problem. Second, each intermediate token conditions the next, so 'what's 17 × 23?' becomes 'what's 17 × 20 + 17 × 3?' — easier next-token decisions. The model isn't reasoning in any deep sense; it's decomposing a hard one-shot prediction into a sequence of easier ones.

Large language model Tree-of-thought

Where does CoT not help?

Simple factual recall ('what's the capital of France?'), pattern-matching tasks (sentiment, intent classification), and short-form generation. Adding CoT here just produces a chatty preamble before the same answer. Worse, on calibration-sensitive tasks CoT can make the model more overconfident — the elaborated reasoning sounds authoritative even when wrong.

Hallucination Score calibration

Reasoning models like o3 — is that just CoT baked in?

Roughly yes, with extra training. Reasoning models are post-trained with reinforcement learning to produce long internal chains of thought before answering, with the chain often hidden from the user. The 'reasoning' is still next-token prediction; the difference is that the policy was specifically optimized on tasks where extended thinking pays off. For non-reasoning models, explicit CoT prompting gets you most of the way.

Fine-tuning Self-consistency

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs