Also known as: inference-time scaling, test-time scaling, TTC, best-of-N, inference compute
TL;DR
Test-time compute trades inference budget for accuracy by spending more tokens, samples, or search steps per query. Self-consistency, best-of-N, reasoning chains, and tree search are all instances. It's the substrate behind the o1 / R1 reasoning-model paradigm.
Test-time compute is the family of inference techniques that trade compute per query for accuracy. Spend more tokens, more samples, or more search steps on each query, and a large language model gets sharper at problems with verifiable answers. The 2025 reasoning-model paradigm is built on this trade.
The taxonomy
Test-time compute techniques
Chain-of-thought . Single sample, reasoning steps inline. 3-5× the token cost of greedy; the first big jump on math and multi-step tasks.
Self-consistency . N reasoning paths, majority-vote the final answer. Pareto-improves CoT at 2-10× more compute. The cheapest reliable trick when a discrete answer space exists.
Best-of-N. Sample N candidates, pick the one with the highest verifier or reward-model score. Powerful when a verifier exists; cheap to scale because the only new cost is the scorer.
Tree-of-thought and MCTS. Search a tree of partial reasoning steps with a learned value function. Highest cost — 100-500× a single sample — but biggest gains on hard combinatorial problems.
Reflection and critique loops . Generate, critique, revise. Catches errors a single pass misses, especially in code and structured output.
Reasoning-model “thinking.” RL-trained inline chains, often 10K+ hidden tokens per query. TTC baked into the model rather than orchestrated externally.
Why it works (and when it doesn’t)
Test-time compute helps when the task has a verifiable signal and the model has at least some probability mass on the right answer. It fails when the task is recall-bound, or when the verifier is noisy enough that best-of-N converges to verifier hacks instead of correct answers.
Verifiability — math correctness, executable tests for code, constraint satisfaction for formal logic — is what lets sampling exploration pay off; the verifier picks the right one out of N tries. Probability mass is what makes the search non-empty: if the base model never produces the right answer in any of N samples, more samples buy nothing.
The compute-quality trade
Snell et al. (2024) made the trade explicit: on math and reasoning benchmarks, 4× test-time compute on a smaller model often matches a 4× larger pretrained model under greedy decoding. That elasticity is what makes the o1-style paradigm work — invest pretraining compute up to a point, then scale inference compute instead. The result reframes scaling laws as a two-axis surface across pretraining and inference budgets, not a one-axis pretraining curve.
The connection to RL training
Modern reasoning models (o1, R1, Gemini 2.5 Thinking, GPT-5) bake TTC into the policy via RL: a process reward model scores each step of a reasoning chain, and the model learns to spend test-time compute productively rather than ramble. The signal is dense and verifier-grounded, building on the same RLHF machinery but with automatic correctness rewards replacing human preferences.
The distinction between external TTC (best-of-N applied at inference) and internal TTC (RL-trained inline thinking) is collapsing. A reasoning model with extended-thinking budget is doing best-of-N inside its own activations — exploring trajectories, scoring with a learned value function, committing when confident. The orchestration layer that used to sit outside the model now lives in its weights.
High-temperature sampling diversifies output but doesn’t help unless an aggregator — verifier, majority vote, reward model — can pick the right candidate. Without one, you’ve replaced one mediocre answer with N mediocre answers and a coin flip.
The aggregator is load-bearing; temperature is the cheap diversification knob alongside it. Self-consistency works because the aggregator is majority vote on a discrete answer space, which is free; best-of-N works because the aggregator is a trained reward model or executable verifier. Strip the aggregator out and the technique degenerates into “sample N, pick one at random.” Reasoning-model RL is the same story internally — the policy learned an aggregator and can commit to a trajectory mid-chain without an external scorer.
o1-style reasoning models bill the hidden thinking tokens — a typical hard query can consume 5-50K tokens before any visible output, and the API charges for every one. The same prompt can cost 1.00 depending on how long the model decides to think; per-query cost is variable in a way GPT-4-era API economics didn’t have to model.
The operational implication is budget caps on max_reasoning_tokens, latency SLOs that account for thinking time, and cost monitoring on the distribution, not the mean. The cost-side complement is speculative decoding , which reduces wall-clock and dollar cost of generating those tokens — TTC scales the quantity of compute, speculative decoding scales the efficiency. Production stacks need both.
Go further
Is test-time compute a substitute for pretraining compute?
Partly. Snell et al. (2024) showed that on math and reasoning tasks, 4× test-time compute on a smaller model often matches a 4× larger model with greedy decoding. The trade is real but task-dependent — reasoning tasks scale well with TTC; pure recall does not.
Best-of-N samples N independent answers and picks the one with the highest reward-model or verifier score. Self-consistency samples N reasoning paths and majority-votes the final answer. Both use the same compute primitive (parallel sampling) but differ on the aggregation rule.
Post-o1 reasoning models are trained with RL signals on verifier-checkable problems (math, code), reinforcing long internal chain-of-thought sequences. Generation expands until the model commits to an answer — typically thousands to tens of thousands of hidden 'thinking' tokens before any visible output.