Cross-Entropy Loss

Q: How does it relate to KL divergence?

Cross-entropy H(p, q) = H(p) + KL(p || q). Since H(p) is fixed by the data, minimizing cross-entropy is equivalent to minimizing the KL divergence from p to q. That is why cross-entropy works as a learning signal at all.

Q: Why does cross-entropy pair with softmax?

Softmax turns raw logits into a probability distribution q. Cross-entropy then measures how far q is from the target. The composition has a clean gradient — q − p — which is what makes the pair the dominant choice for classification heads.

Also known as: log loss, negative log-likelihood, NLL, categorical cross-entropy

TL;DR

Cross-entropy loss is — the average number of extra nats it costs to encode samples from the true distribution using the model's predicted distribution .

Cross-entropy loss measures how well a predicted probability distribution matches a target distribution . The formula is , summed over the classes (or vocabulary tokens). Information-theoretically, it is the average number of nats per sample needed to encode draws from using a code optimized for . In ML it is the default loss for classification and for the next-token prediction objective that trains every modern large language model .

The shape that matters

In practice, the target is almost always a one-hot vector — a single correct class index. The sum collapses to a single term:

loss = -log q(y_correct)

This is exactly the negative log-likelihood of the correct label under the model. “Cross-entropy” and “negative log-likelihood” are two names for the same scalar; the first is the information-theoretic framing, the second the statistical one.

The model’s is produced by applying softmax to a vector of logits . For a vocabulary of size , is a -dimensional vector summing to 1, and the loss reaches down into it for the index of the correct token.

Why every LLM uses it

A transformer trained with the next-token objective sees a sequence of tokens and is asked to predict the next one at every position. The loss at each position is the cross-entropy between the one-hot true next token and the model’s softmax over the vocabulary. Averaged over a batch, this is the single scalar that gradient descent minimizes during pretraining .

Perplexity — the standard reporting metric for language models — is just . A perplexity of 20 means the model is, on average, as uncertain about the next token as if it were choosing uniformly among 20 options.

Relationship to KL divergence

Cross-entropy decomposes:

H(p, q) = H(p) + KL(p || q)

is the entropy of the data distribution — fixed and uncontrollable. The only piece you can move with gradients is . So minimizing cross-entropy is exactly minimizing the KL divergence from your predictions to the truth, up to an additive constant. That equivalence is the reason cross-entropy is a sensible learning signal: it makes track .

For a softmax output and one-hot target , the gradient of the loss with respect to the logits works out to:

dL/dz = q - p

Just the difference between the predicted probabilities and the target. No softmax derivative explicitly appears — it cancels with the log inside the cross-entropy. This is why every classification head in deep learning uses softmax + cross-entropy together: the gradient computation is one subtraction per logit, numerically stable and fast to evaluate.

If you tried to compose softmax with mean-squared error instead, you would get a tangled product of softmax derivatives that vanishes for confident-but-wrong predictions, making the model learn far more slowly. The pairing is not arbitrary — it is the unique loss for which softmax has a clean, non-saturating gradient.

When cross-entropy is the wrong choice

Common alternatives

Regression targets — use mean-squared error or Huber loss; cross-entropy assumes a categorical output.
Ranking — pairwise losses (BPR, RankNet) or listwise losses optimize ordering, not absolute probabilities.
Heavy class imbalance — focal loss or weighted cross-entropy down-weight easy negatives.
Embedding training — contrastive losses like InfoNCE replace cross-entropy when you want representations to live in a similarity space.

For categorical targets with a fixed class set, though, cross-entropy is the default — and the reason it is the default is that gradient descent on it has all the right asymptotic properties: it converges to the maximum-likelihood estimate, and on calibrated data the resulting matches the true conditional distribution.

Go further

Is cross-entropy the same as negative log-likelihood?

When the target is a one-hot label (a single correct class), yes — the sum collapses to -log q(y), which is exactly the negative log-likelihood of the correct class. The two names describe the same loss from two different angles: information-theoretic (cross-entropy) versus statistical (NLL).

Logits Perplexity

How does it relate to KL divergence?

Cross-entropy H(p, q) = H(p) + KL(p || q). Since H(p) is fixed by the data, minimizing cross-entropy is equivalent to minimizing the KL divergence from p to q. That is why cross-entropy works as a learning signal at all.

KL divergence

Why does cross-entropy pair with softmax?

Softmax turns raw logits into a probability distribution q. Cross-entropy then measures how far q is from the target. The composition has a clean gradient — q − p — which is what makes the pair the dominant choice for classification heads.

Softmax Logits

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs