Top-p (Nucleus) Sampling

Also known as: nucleus sampling, top-p, top_p

TL;DR

Top-p sampling restricts each step's sampling to the smallest set of tokens whose cumulative probability is at least p. Unlike top-k's fixed cutoff, the nucleus adapts to the distribution's shape.

Top-p sampling — also called nucleus sampling — is a token-selection strategy that, at each step, samples only from the smallest set of tokens whose cumulative probability sums to at least . Tokens outside that set are excluded. It’s a more principled alternative to top-k’s fixed cutoff because the size of the sampling set adapts to the distribution: narrow when the model is confident, wide when it isn’t. Top-p with is the default sampling parameter in most production LLM APIs.

The mechanics

Given the post-temperature softmax distribution (sorted descending) over the vocabulary:

Find the smallest such that .
Renormalize the top probabilities: for , zero otherwise.
Sample from the renormalized distribution.

The size of the nucleus (the top ) varies per step. If the top token has 99% probability, the nucleus might be just that one token — top-p degenerates to greedy. If the top 200 tokens each have ~0.5% probability, the nucleus is wide. The cutoff happens where the cumulative mass crosses , which is exactly where the long tail starts.

The key observation is that low-probability tokens contribute disproportionately to incoherent generation but only marginally to perplexity. The bottom 99% of tokens by probability rank contribute essentially nothing meaningful to a coherent next-token guess, but at high temperature their cumulative mass can still be a few percent — enough to occasionally sample one and produce nonsense. Sorting by probability and clipping at cumulative mass surgically removes the noise tail without distorting the relative weights of the plausible candidates. The original Holtzman et al. 2019 paper showed this beat both pure sampling and beam search on text-quality metrics.

Why it generally beats top-k

Top-k sampling restricts to the highest-probability tokens with fixed (typically 40-100). The problem is that distributions vary wildly in shape:

Confident step. “The capital of France is ___” — the top token (Paris) has ~99% probability. Top-k=40 includes 39 tokens of essentially zero probability that get artificially boosted by renormalization.
Uncertain step. “I think we should ___” — hundreds of plausible continuations, each at ~0.3%. Top-k=40 cuts off legitimate alternatives.

Top-p adapts. At the confident step, the nucleus is 1-2 tokens. At the uncertain step, the nucleus might be 200. The cutoff where cumulative probability accumulates is a more meaningful boundary than a fixed token count.

Empirically, top-p outperforms top-k on most text-quality and perplexity-on-held-out metrics. Most modern systems either use top-p alone or top-p combined with a generous top-k cap as a backstop.

Common values and their behavior

p = 1.0 — no filtering. Equivalent to standard temperature sampling.
p = 0.95 — typical default for “creative” generation. Allows long-tail diversity while suppressing the worst noise.
p = 0.9 — tighter. The default in most production APIs (OpenAI, Anthropic). Good general-purpose value.
p = 0.7 — quite tight. Approaches greedy behavior on confident-distribution steps; useful for tasks where you want determinism but with a small amount of variability.
p = 0 — degenerate. Only the argmax token is in the nucleus. Identical to greedy decoding.

The interaction with temperature matters: at low temperature the distribution is already peaky and top-p has little to do; at high temperature top-p is what prevents the flattened tail from polluting output. The standard production combination is moderate temperature (T ~ 0.7-1.0) and moderate top-p (~0.9). Pushing both knobs simultaneously into extreme territory rarely improves anything.

When to use neither

For tasks consumed by code — structured extraction, code generation, factual QA — temperature 0 is the correct choice and top-p is irrelevant. The whole nucleus-sampling apparatus exists to support diversity in human-facing generation; it has no role when you want a single deterministic answer.

For self-consistency ensembling and similar techniques that want multiple diverse samples, you typically want top-p around 0.9-0.95 with temperature around 0.7. The goal is samples that explore different reasoning paths without veering into incoherence.

What ships in 2026

Top-p is the universal default; top-k still ships as a knob but is rarely the primary control. Newer techniques (typical sampling, min-p, eta sampling) haven’t displaced it — top-p is good enough, every API exposes it, and the marginal improvement isn’t worth a migration. Boring-but-correct is the right place for the default to be.

Go further

Why does top-p generally beat top-k?

Top-k uses a fixed number of tokens (say 40), regardless of how peaked the distribution is. When the model is highly confident, k=40 includes lots of tail garbage; when uncertain, k=40 might cut off legitimate alternatives. Top-p adapts: it includes 2 tokens when the top two are 99% of the mass, and 200 when the distribution is flat. The adaptive cutoff is a more principled match to the underlying distribution shape.

Temperature sampling

What's a good top-p value in production?

0.9 is the canonical default. 0.95 if you want slightly more diversity, 0.8 if you want tighter constraint. Below 0.7 it starts to feel close to greedy on confident-distribution tokens; above 0.99 it's essentially unconstrained sampling. The right value depends on temperature — at low T the distribution is already peaky and top-p has less work to do; at high T top-p is what stops the long tail from polluting output.

Temperature sampling

Should I use top-p AND top-k together?

Sometimes — it's a belt-and-suspenders approach. Top-k caps the maximum number of tokens considered (defends against pathological flat distributions), top-p caps the cumulative probability (defends against the long tail). Using both with sensible values (k = 50, p = 0.9) works fine and is a common default. The marginal gain over top-p alone is small.

Logits

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs