Self-Consistency

Also known as: self-consistency decoding, majority-vote reasoning, CoT-SC

TL;DR

Self-consistency samples N independent chain-of-thought reasoning paths and majority-votes the final answer. It's the cheapest test-time-compute trick.

Self-consistency is a test-time decoding strategy that pairs naturally with chain-of-thought : instead of generating one reasoning chain at temperature 0 and trusting it, sample N chains at non-zero temperature and pick the answer that appears most often. The intuition, made explicit by Wang et al. (2022): correct reasoning paths tend to converge on the same final answer, while incorrect paths spray across many wrong answers. Voting filters the noise.

Self-consistency is classical ensembling with the same model standing in for multiple base learners. The temperature is what decorrelates them.

The procedure

Construct a CoT prompt for the task.
Sample N completions independently with temperature > 0 (typically 0.5-0.9).
Extract the final answer from each completion (e.g., the integer after “Answer:” for math problems, the JSON object for structured tasks).
Take the mode of the extracted answers as the final output.

That’s it. No additional model, no extra training, no special inference machinery — just N independent calls plus a tally.

Why it works

The key statistical claim is that errors across CoT samples are uncorrelated enough that voting recovers the right answer. Each sample, with non-zero temperature, follows a different trajectory through the reasoning space; bad trajectories produce different bad answers; good trajectories tend to land on the same correct one. The right answer wins by plurality.

This breaks down whenever the model has a systematic bias — if the LLM consistently misreads a particular kind of word problem, all N samples may agree on the wrong interpretation. Voting reduces variance, not bias. A miscalibrated prior shows through every sample equally.

Cost-quality tradeoff

Self-consistency is the simplest test-time-compute lever after plain CoT:

Strategy	Inference cost	Typical accuracy gain over baseline
Greedy decode	1x	0%
Single CoT	~3-5x tokens	+10-30% on math
Self-consistency, N=10	~30-50x tokens	+5-15% over single CoT
Tree-of-thought	~100-500x	Task-dependent further gains

Numbers are task-dependent but the shape is consistent. Self-consistency is the best dollar-per-accuracy-point in this ladder for reasoning tasks. Past N≈20 the marginal accuracy per extra sample becomes negligible.

Where it fits in production

Self-consistency lives in the same niche as CoT: high-value, low-volume reasoning tasks where seconds of latency are acceptable. It’s standard in research benchmarks (GSM8K, MATH, BBH) and reasonable in user-facing chat where the user expects to wait. It’s almost never the right choice for high-volume backend tasks (retrieval ranking, classification, query rewriting) where per-call cost dominates — there, fine-tuning a small specialized model wins on every dimension.

When self-consistency is worth the cost

Math word problems — GSM8K, MATH; voting recovers 5-15 absolute accuracy points.
Multi-hop reasoning over docs — multiple chains land on the same supported answer; bad chains land on different unsupported ones.
Code generation with test cases — sample N completions, pick the one that passes the most tests (a voting variant).
Faithfulness checks — ask N times “is this claim supported by this source?”; majority vote denoises individual judgments.
High-stakes structured extraction — when getting one field wrong costs more than 10× more compute.

All three are test-time-compute strategies that trade inference dollars for accuracy, but they use that compute differently.

Self-consistency samples N independent chains, extracts a discrete answer from each, and majority-votes. No external model. Cheap. Requires a discretizable answer space.
Best-of-N samples N independent chains and picks the one with the highest score from an external reward model (e.g., a verifier or judge). More expensive (requires the RM) but works on open-ended generation where voting cannot.
Tree-of-thought explores reasoning as a tree, branching at intermediate states with self-evaluation pruning. Vastly more compute, vastly more flexible, but rarely worth it outside of agentic problem solving where the search structure naturally matches the task.

Modern reasoning models (o1, DeepSeek-R1) effectively internalize a learned variant of best-of-N — they generate long chains-of-thought and self-evaluate, with the equivalent of voting baked into RL training rather than runtime sampling. For tasks where you cannot fine-tune, self-consistency at N=10 remains the cheapest reliable trick.

Sampling temperature controls the variance-bias tradeoff between samples. At T=0, all N samples are identical (no decorrelation, no benefit from voting). At T=1.0+, samples decorrelate fully but each individual sample’s quality drops — you’re voting between many low-quality reasoners.

The empirical sweet spot is T ≈ 0.5-0.7 for most reasoning tasks at N=5-20. Crucially, the right T grows with N: for N=5, T ≈ 0.4 maximizes accuracy; for N=40, T ≈ 0.8 does. The intuition is that more samples can absorb more individual-sample noise, so you can afford to push each sample further from the argmax to gain decorrelation. Tuning T on a held-out dev set is cheap and pays back consistently.

Go further

How many samples is enough?

Diminishing returns kick in quickly. Most papers see the bulk of the gain by N=5-10; beyond N=20-40 the curve flattens. The right N depends on how confident the base model is — for tasks where the model is 90% accurate already, 5 samples suffices; for tasks where it's 30% accurate, more samples help but you'd be better off finding a different prompting strategy entirely.

Chain-of-thought Score calibration

Why does voting work? Aren't all samples drawn from the same model?

The mechanism is that incorrect reasoning chains tend to disagree with each other (many ways to be wrong), while correct chains tend to converge on the same final answer (one right answer). With non-zero temperature, samples diverge enough early that errors decorrelate; the right answer surfaces as the modal output. It's classical ensembling, with the same model standing in for multiple base learners.

Pairwise preference zELO

Where does self-consistency break?

When the model is systematically wrong in a correlated way — e.g., all samples agree on the wrong answer because the underlying misconception is in the model's prior. Voting can't fix bias, only variance. It also breaks on open-ended generation (no canonical 'final answer' to vote on); voting only works when outputs reduce to a small set of equivalence classes.

Hallucination Tree-of-thought

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs