Also known as: log loss, negative log-likelihood, NLL, categorical cross-entropy
TL;DR
Cross-entropy loss is — the average number of extra nats it costs to encode samples from the true distribution using the model's predicted distribution .
Cross-entropy loss measures how well a predicted probability distribution matches a target distribution . The formula is , summed over the classes (or vocabulary tokens). Information-theoretically, it is the average number of nats per sample needed to encode draws from using a code optimized for . In ML it is the default loss for classification and for the next-token prediction objective that trains every modern large language model .
The shape that matters
In practice, the target is almost always a one-hot vector — a single correct class index. The sum collapses to a single term:
loss = -log q(y_correct)
This is exactly the negative log-likelihood of the correct label under the model. “Cross-entropy” and “negative log-likelihood” are two names for the same scalar; the first is the information-theoretic framing, the second the statistical one.
The model’s is produced by applying softmax to a vector of logits . For a vocabulary of size , is a -dimensional vector summing to 1, and the loss reaches down into it for the index of the correct token.
Why every LLM uses it
A transformer trained with the next-token objective sees a sequence of tokens and is asked to predict the next one at every position. The loss at each position is the cross-entropy between the one-hot true next token and the model’s softmax over the vocabulary. Averaged over a batch, this is the single scalar that gradient descent minimizes during pretraining .
Perplexity — the standard reporting metric for language models — is just . A perplexity of 20 means the model is, on average, as uncertain about the next token as if it were choosing uniformly among 20 options.
Relationship to KL divergence
Cross-entropy decomposes:
H(p, q) = H(p) + KL(p || q)
is the entropy of the data distribution — fixed and uncontrollable. The only piece you can move with gradients is . So minimizing cross-entropy is exactly minimizing the KL divergence from your predictions to the truth, up to an additive constant. That equivalence is the reason cross-entropy is a sensible learning signal: it makes track .
For a softmax output and one-hot target , the gradient of the loss with respect to the logits works out to:
dL/dz = q - p
Just the difference between the predicted probabilities and the target. No softmax derivative explicitly appears — it cancels with the log inside the cross-entropy. This is why every classification head in deep learning uses softmax + cross-entropy together: the gradient computation is one subtraction per logit, numerically stable and fast to evaluate.
If you tried to compose softmax with mean-squared error instead, you would get a tangled product of softmax derivatives that vanishes for confident-but-wrong predictions, making the model learn far more slowly. The pairing is not arbitrary — it is the unique loss for which softmax has a clean, non-saturating gradient.
When cross-entropy is the wrong choice
Common alternatives
Regression targets — use mean-squared error or Huber loss; cross-entropy assumes a categorical output.
Ranking — pairwise losses (BPR, RankNet) or listwise losses optimize ordering, not absolute probabilities.
Heavy class imbalance — focal loss or weighted cross-entropy down-weight easy negatives.
Embedding training — contrastive losses like InfoNCE replace cross-entropy when you want representations to live in a similarity space.
For categorical targets with a fixed class set, though, cross-entropy is the default — and the reason it is the default is that gradient descent on it has all the right asymptotic properties: it converges to the maximum-likelihood estimate, and on calibrated data the resulting matches the true conditional distribution.
Go further
Is cross-entropy the same as negative log-likelihood?
When the target is a one-hot label (a single correct class), yes — the sum collapses to -log q(y), which is exactly the negative log-likelihood of the correct class. The two names describe the same loss from two different angles: information-theoretic (cross-entropy) versus statistical (NLL).
Cross-entropy H(p, q) = H(p) + KL(p || q). Since H(p) is fixed by the data, minimizing cross-entropy is equivalent to minimizing the KL divergence from p to q. That is why cross-entropy works as a learning signal at all.
Softmax turns raw logits into a probability distribution q. Cross-entropy then measures how far q is from the target. The composition has a clean gradient — q − p — which is what makes the pair the dominant choice for classification heads.