Temperature Sampling

Also known as: softmax temperature, sampling temperature, T parameter, greedy decoding

TL;DR

Temperature is a scalar that divides the logits before softmax, controlling how peaked or flat the next-token distribution is. Temperature 0 is greedy decoding (always pick the argmax); higher temperatures sample more diversely.

Temperature is a single scalar applied to the logits before softmax during sampling: . As , the distribution collapses to a one-hot at the argmax (greedy decoding). As , it flattens toward uniform. Everything in between trades off between determinism and diversity. It’s the single most-tweaked sampling knob, and it sits at the heart of every other sampling-related decision.

The mechanics

Given the model’s raw logits over the vocabulary, the sampled distribution at temperature is:

Temperature 1.0 is the baseline — the model’s “natural” distribution. Below 1.0, gaps between logits are amplified, so high-probability tokens get more probability mass and low-probability tokens get less. Above 1.0, gaps are compressed, so the distribution flattens.

Temperature 0 is a special case: deterministic argmax. (Implementations either special-case T=0 or set T to a tiny positive number — the result is the same.)

The math is the same, but the effective temperature scale isn’t always identical, because different providers post-process logits differently before applying T. Some apply repetition penalties, some re-normalize after top-p filtering, some use slightly different default T. T = 0.7 on OpenAI is similar but not identical to T = 0.7 on Anthropic or vLLM. For reproducibility, the only safe approach is to fix T explicitly and test empirically with your task — don’t assume cross-provider equivalence.

Why temperature 0 isn’t always best

The intuitive argument for T=0: “I want the model’s best answer, so pick the most likely token at every step.” But that’s a local optimum that can lead into globally bad paths.

Three concrete reasons higher temperatures sometimes win:

1. Sequence-level likelihood. Greedy decoding maximizes at each step but doesn’t maximize overall. A token with the locally-second-highest probability might lead into a sentence that’s globally more likely than the greedy continuation. This is why beam search (which considers multiple paths) often beats greedy on sequence-level metrics like BLEU.

2. Mode collapse. At T=0 the model produces the exact same output every time. Useless for brainstorming, ensemble methods, or any case where you want diverse candidates. Self-consistency explicitly samples multiple chains-of-thought at T~0.7 and votes; T=0 makes the technique meaningless.

3. Calibration. Modern LLMs trained with cross-entropy loss tend to be over-confident — the gap between the top logit and the runner-up is larger than the empirical evidence justifies. Setting T slightly above 1.0 (or applying temperature-scaling post hoc) often produces better-calibrated probabilities on out-of-distribution inputs. This matters when you’re using model probabilities downstream (uncertainty quantification, abstention, RAG decision-making).

When temperature 0 IS best

For tasks where any deviation from the most-likely answer is wrong — code generation against a spec, factual extraction, structured-output schemas, constrained-decoding over a strict grammar — T=0 is the right default. The variability bought by higher T is pure downside.

The rule of thumb: if there’s a single correct answer and you’d be unhappy with a different answer, use T=0. If the task tolerates variation (creative writing, conversational replies, generating multiple candidates for ensembling), use T > 0.

Calibration as a downstream concern

If your application uses the model’s confidence (a router that asks the model “are you sure?”, an abstention layer that refuses low-confidence answers, an LLM-as-judge that scores), temperature affects what those probabilities mean. A model run at T=0 produces a probability of 1.0 on its argmax — but that doesn’t mean the answer is actually certain, just that the math is degenerate.

For confidence-aware applications, sample at T=1.0 (or even higher) and inspect the full distribution before applying any temperature change. The probability of the chosen token under the unmodified distribution is a more honest confidence signal than anything you can extract after a temperature transformation.

Production defaults

For most production systems: T = 0 for any LLM call whose output is consumed by code, T = 0.7 for any output consumed by a human conversational interface, T = 1.0 when generating diverse candidates for downstream voting or reranking . Don’t reach for exotic values without an eval that justifies them. The interaction with top-p sampling is more important than the exact T value; tune them together against task quality.

Go further

Why isn't temperature 0 always optimal?

Greedy decoding (T=0) maximizes the likelihood of each individual token but doesn't maximize the likelihood of the whole sequence — the locally-best token can lead into a low-probability path. It also kills any diversity, which matters for creative tasks, brainstorming, and self-consistency-style ensembling. For most factual tasks T=0 is fine; for generative ones, higher T usually wins.

Self-consistency Top-p sampling

How does temperature interact with calibration?

Temperature shifts the distribution but preserves the argmax — so calibration metrics (expected calibration error, etc.) measured after softmax depend on T. Models trained with cross-entropy loss tend to be over-confident at T=1; turning T up slightly often improves calibration on out-of-distribution data. This is the temperature-scaling technique from Guo et al. 2017, where T is fit on held-out validation data after training and then applied at inference.

Score calibration Logits

What temperature do production systems use?

It depends on the task. Code generation, factual QA, structured extraction: T = 0 to 0.3. Conversational assistants: T = 0.7 to 1.0. Creative writing, brainstorming: T = 1.0 to 1.4. Self-consistency ensembling: T around 0.7 to generate diverse samples that are then voted on. Above T=1.5 the model becomes incoherent quickly.

Self-consistency Top-p sampling

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs