Reasoning Model

Also known as: thinking model, reasoning-tuned LLM, test-time compute

TL;DR

A reasoning model is an LLM trained to spend test-time compute on internal chain-of-thought before answering. The post-o1 paradigm: pretraining + SFT + RL on verifier-checkable problems, with hidden 'thinking' tokens as the substrate.

A reasoning model is an LLM trained to spend test-time compute on internal reasoning before producing its final answer. The pattern was popularized by OpenAI’s o1 (September 2024), and within a year the same shape appeared across the field: o3, DeepSeek-R1, Gemini 2.0 Thinking, Claude’s extended-thinking mode. By 2025, reasoning models are the frontier; non-reasoning models are the cost-tier.

The three ingredients

What makes a reasoning model

Chain-of-thought as the substrate. The model’s internal reasoning is just more next-token prediction — a long, often hidden, scratchpad of intermediate work before the final answer. Mechanically: autoregressive generation, same as any LLM.
RL on verifier-checkable problems. Math, competitive programming, formal proofs, structured logic. The reward is outcome correctness against an automatic verifier (a unit-test runner, a math evaluator, a theorem prover) — not against a noisy human-preference reward model.
The unlock: verification is easier than generation. For these domains, checking whether an answer is correct is dramatically cheaper than producing the answer. That asymmetry means you can manufacture vast quantities of RL signal almost for free — generate-and-check is essentially unlimited training data.

Verification-easier-than-generation is the regime that makes long-chain reasoning trainable. The same recipe doesn’t transfer to essay writing or design critique because the verifier doesn’t exist; that’s why current reasoning models excel at math and code, and not at open-ended creative work.

The “thinking” architecture

Reasoning models internalize chain-of-thought as default behavior. The user prompts a question; the model emits thousands of tokens of internal deliberation (often inside <thinking> tags or an out-of-band stream); finally it emits the answer. The chain is typically hidden — you see the answer plus, sometimes, a summary.

The key training detail: the model was RL’d to use the scratchpad even when not asked. CoT prompting on a non-reasoning model relies on the user remembering to add “let’s think step by step”; a reasoning model produces the chain unconditionally, because the policy learned during RL that long chains earn higher reward on its training distribution.

Why this is a paradigm shift

The 2022-2024 recipe:

pretraining SFT RLHF (or DPO )

The 2025+ recipe:

pretraining SFT RL on verifiable rewards (often with PRMs or outcome rewards)

The first stage stayed the same. The second stage stayed the same. The alignment stage changed substrates — from “what humans prefer” to “what a verifier accepts.” That swap is the single biggest post-training change since RLHF itself.

What reasoning models are good at — and not

The pattern is sharp, and it follows directly from the verifier-availability story:

Strong: competition math, competitive programming, multi-step physics and chemistry, formal logic, anything where there’s a checkable answer. o3 and R1 hit superhuman levels on AIME, USAMO, and Codeforces benchmarks.
Mixed: open-ended technical reasoning (system design, debugging) where the verifier is fuzzy. Reasoning models help but the gap to non-reasoning models is smaller.
Weak / no improvement: creative writing, calibration-sensitive judgment, retrieval ranking. CoT often hurts here for the same reason it hurts in non-reasoning models — the model talks itself into confident-wrong answers, and there was no verifier during training to discipline that.

The RL-on-verifiable-rewards loop

The actual training pattern, in production form:

Sample a problem with a known-correct answer (or a verifier).
Have the policy produce a chain-of-thought + answer.
Verify. Score the chain — outcome reward, or per-step PRM reward, or a mix.
Update the policy with PPO, GRPO, or one of the DPO-family losses adapted to outcome-rewarded data.
Repeat for tens of millions of episodes.

The compute is dominated by the inference (generating chains) and the verification (running them against the checker). The gradient updates are cheap by comparison. This is why reasoning-model training looks more like a high-throughput inference pipeline with an RL update tail than like classical model training.

Two compounding effects. First, longer chains let the model decompose harder problems into more, smaller next-token decisions — the same compute-amortization story as standard CoT. Second, training on verifier-graded chains taught the policy which extension patterns earn reward — backtracking when stuck, self-checking before committing, exploring multiple approaches. So during inference, more tokens means more productive exploration, not just more tokens.

The empirical scaling curve is striking: Lightman et al. and the o1 system card both show roughly log-linear improvement in pass-rate as a function of inference token budget, on math benchmarks, over two or three orders of magnitude of test-time compute. There’s no equivalent curve for non-reasoning models — they saturate quickly because they weren’t trained to use the extra tokens.

The catch: this curve only holds where the verifier exists. Outside that regime, the model is generating tokens without the trained-in steering, and the curve flattens fast.

The full causal chain: RL on verifier-checkable problems trains the model on a high-density, low-noise reward signal. That signal is what teaches the policy to adopt productive long-chain patterns (backtracking, self-checking, multi-approach exploration). Take away the verifier — train on a fuzzy LLM-judge or human-preference reward instead — and the signal becomes noisier and less dense, and the trained-in long-chain habits become less reliable.

This is the central limitation of the reasoning-model paradigm. The places it works (math, code, formal logic) are exactly the places where automatic verification is tractable. Domains where verification is hard (essay writing, design, strategy, anything multi-modal where humans disagree about quality) don’t have a clean path to the same training recipe. So we get reasoning models that are superhuman at AIME and merely decent at writing a coherent product spec — and there’s currently no obvious fix beyond “build better verifiers”, which is an open research problem.

The optimistic version: as LLM judges get more reliable (zELO-style calibrated judges, multi-judge ensembles, structured rubrics), the verifier frontier expands, and the reasoning recipe expands with it. The pessimistic version: outside math and code, the noise floor of judgment dominates, and you can’t RL your way past a noisy reward.

The market shape

By mid-2025, every frontier lab ships a reasoning model: OpenAI’s o3 family, Anthropic’s Claude with extended thinking, Google’s Gemini Thinking, DeepSeek’s R1, Mistral’s Magistral. Pricing reflects the new dimension — reasoning models cost 5-50× per query versus non-reasoning models, and customers pay it because the quality gap on hard problems is decisive. The non-reasoning tier hasn’t gone away; it’s the latency and cost-sensitive substrate for chat, coding-assist completion, and high-volume backend tasks. The reasoning tier is for hard problems where you can afford to wait.

The strategic picture: pretraining-compute scaling is hitting diminishing returns, test-time-compute scaling is open-ended, and verifier-based RL is the lever that converts inference budget into capability. That’s the recipe for the next several years.

Go further

What changed between RLHF-era and reasoning-era post-training?

RLHF rewarded the policy for outputs humans preferred — a noisy, subjective signal that capped quality at human-judgment quality. Reasoning-era RL rewards the policy for outputs that pass automatic verifiers — a clean, dense, scalable signal that can exceed any individual human's evaluation accuracy. The shift from human preference to verifier correctness is the substrate that makes long-chain reasoning trainable.

RLHF Process reward model

Why is the chain hidden from the user?

Two reasons. Product: chains are noisy, repetitive, and often look unhinged ('wait, no, let me try again') — they make great training signal but bad UX. Competitive: the chain is the most direct window into how the model was trained, including which RL rewards it learned to chase, which lets competitors reverse-engineer your post-training recipe. The user sees a final answer plus, sometimes, a sanitized summary.

Chain-of-thought Self-consistency

Is reasoning-mode just CoT prompting under the hood?

Mechanically similar — autoregressive token generation conditioned on a long internal context. The difference is training: a reasoning model has been RL'd on tens of millions of verifier-graded chains, so it learned which patterns of chain-of-thought actually produce correct answers on its training distribution. A non-reasoning model with a CoT prompt is doing the right thing structurally but without the reinforced trajectory bias. That training is the reason o1 dramatically outperforms GPT-4 + 'let's think step by step' on hard math even though the substrate looks the same.

Tree-of-thought DPO

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs