Softmax

Also known as: softargmax, normalized exponential

TL;DR

Softmax maps a vector of real numbers to a probability distribution: each output is exp(xᵢ) divided by the sum of exp(xⱼ). It is the function that turns logits into next-token probabilities and attention scores into weights.

Softmax takes a vector of real numbers and returns a probability distribution over the same number of slots. The formula is simple: . Every entry is non-negative, every entry is bounded by 1, and the entries sum to 1. It is the function that turns logits into next-token probabilities, attention scores into attention weights, and classifier outputs into class probabilities. Almost every probability inside a transformer is the output of a softmax somewhere.

The formula

Given a vector :

The exponential makes every output positive. The denominator normalizes them to sum to 1. The relative gaps between inputs are what matter — softmax is invariant to additive constants, since adding to every entry multiplies numerator and denominator by the same .

That invariance is the workhorse property. It enables the numerical-stability trick, makes temperature-scaling well-defined, and means logits are only ever meaningful up to a constant offset.

Numerical stability

Naive softmax overflows fast. already exceeds the range of fp32. The standard fix is to subtract the max before exponentiating:

After subtraction, every shifted logit is in , every exponential is in , and the sum is well-behaved. The output is identical because of the additive-invariance property. Every production attention kernel, every cross-entropy loss, every transformer training loop does this — typically fused with the surrounding ops.

Temperature

Dividing the logits by a temperature before softmax rescales the distribution:

: collapses to a one-hot at the argmax (hard max).
: ordinary softmax.
: collapses to uniform.

This is the single most important knob in language-model decoding. Lower temperature means more deterministic, higher means more diverse. It’s also how knowledge distillation transfers “dark knowledge”: train a small model on soft targets from the teacher’s high-T softmax.

Where softmax shows up

In a modern transformer

Attention weights — softmax over along the key axis
Next-token probabilities — softmax over the final logit vector
Classifier heads — softmax over class scores in fine-tuned encoders
Mixture-of-experts routing — softmax over expert scores per token
Contrastive losses — softmax over similarity scores against in-batch negatives

The recurring shape is the same: produce a real-valued score vector, softmax it, treat the result as weights or probabilities. Once you see this pattern, transformers stop being mysterious — they’re stacks of linear layers feeding into softmaxes.

Softmax + cross-entropy is the unique pairing where the gradient of the loss with respect to the logits is just — the predicted distribution minus the one-hot target. No exponentials, no log terms, no chain rule pain. This isn’t a coincidence: softmax is the canonical link function for the multinomial in the exponential-family generalization of logistic regression. The clean gradient is what makes the combination the default classification objective in essentially every modern model. Compute the loss in cross-entropy form directly from logits — never compute the softmax probabilities first and then feed them into a separate log; the log-sum-exp identity gives you the same answer with no overflow.

Go further

Why exponentiate before normalizing instead of just dividing by the sum?

Two reasons. Exponentiation forces all weights positive, even if some inputs are negative — a plain sum-normalization would produce nonsense probabilities. And it makes the largest input dominate smoothly, with a derivative that's well-suited to gradient descent. The function is also the unique exponential family that gives multinomial logistic regression its closed form.

Logits Cross-entropy loss

What does temperature actually do?

Dividing logits by T rescales the distribution before exponentiating. T below 1 sharpens it (the argmax dominates); T above 1 flattens it toward uniform. T = 0 is a hard argmax. T is the single most important sampling knob in language-model decoding.

Temperature sampling

Is softmax the only way to get a probability distribution from a vector?

No. Sparsemax and entmax produce sparse outputs (some probabilities are exactly zero). Gumbel-softmax adds noise for differentiable sampling. But for almost everything in modern transformers — attention weights, next-token probabilities, classifier outputs — plain softmax is the default and stays the default.

Attention

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs