Cross-Entropy Loss

Also known as: log loss, negative log-likelihood, NLL, categorical cross-entropy

TL;DR

Cross-entropy loss is — the average number of extra nats it costs to encode samples from the true distribution using the model's predicted distribution .

Cross-entropy loss measures how well a predicted probability distribution matches a target distribution . The formula is , summed over the classes (or vocabulary tokens). Information-theoretically, it is the average number of nats per sample needed to encode draws from using a code optimized for . In ML it is the default loss for classification and for the next-token prediction objective that trains every modern .

CROSS-ENTROPY LOSSOne sample, one log: the loss reaches into q at the true index.PROMPTthe cat sat on the ___MODEL q = softmax(logits)CONFIDENT & RIGHT0.250.500.751.00.72matTRUEp = 1.09sofa.06floor.05rug.04chair.04roofL=− log qtrue=− log(.72)=.33LOSS0 · −log(0.02) ≈ 3.91−log .5−log .25−log .1Sharp mass on the true token — small loss.

The shape that matters

In practice, the target is almost always a one-hot vector — a single correct class index. The sum collapses to a single term:

loss = -log q(y_correct)

This is exactly the negative log-likelihood of the correct label under the model. “Cross-entropy” and “negative log-likelihood” are two names for the same scalar; the first is the information-theoretic framing, the second the statistical one.

The model’s is produced by applying to a vector of . For a vocabulary of size , is a -dimensional summing to 1, and the loss reaches down into it for the index of the correct token.

Why every LLM uses it

A trained with the next-token objective sees a sequence of tokens and is asked to predict the next one at every position. The loss at each position is the cross-entropy between the one-hot true next token and the model’s softmax over the vocabulary. Averaged over a batch, this is the single scalar that gradient descent minimizes during .

— the standard reporting metric for language models — is just . A perplexity of 20 means the model is, on average, as uncertain about the next token as if it were choosing uniformly among 20 options.

Relationship to KL divergence

Cross-entropy decomposes:

H(p, q) = H(p) + KL(p || q)

is the entropy of the data distribution — fixed and uncontrollable. The only piece you can move with gradients is . So minimizing cross-entropy is exactly minimizing the from your predictions to the truth, up to an additive constant. That equivalence is the reason cross-entropy is a sensible learning signal: it makes track .

For a softmax output and one-hot target , the gradient of the loss with respect to the logits works out to:

dL/dz = q - p

Just the difference between the predicted probabilities and the target. No softmax derivative explicitly appears — it cancels with the log inside the cross-entropy. This is why every classification head in deep learning uses softmax + cross-entropy together: the gradient computation is one subtraction per logit, numerically stable and fast to evaluate.

CROSS-ENTROPY GRADIENTThe gradient with respect to the logits is just q − p.L=−log(0.616)=0.485PREDICTED q = softmax( z )0.205z₁0.046z₂0.009z₃0.616z₄0.023z₅0.102z₆TARGET p = one-hot at z₄000100GRADIENT ∂L/∂z = q − pone subtraction per logit+0.205+0.046+0.0090.384+0.023+0.102NO SOFTMAX DERIVATIVE — THE CHAIN RULE CANCELS

If you tried to compose softmax with mean-squared error instead, you would get a tangled product of softmax derivatives that vanishes for confident-but-wrong predictions, making the model learn far more slowly. The pairing is not arbitrary — it is the unique loss for which softmax has a clean, non-saturating gradient.

When cross-entropy is the wrong choice

Common alternatives
  • Regression targets — use mean-squared error or Huber loss; cross-entropy assumes a categorical output.
  • Ranking — pairwise losses (BPR, RankNet) or listwise losses optimize ordering, not absolute probabilities.
  • Heavy class imbalance — focal loss or weighted cross-entropy down-weight easy negatives.
  • Embedding training — contrastive losses like replace cross-entropy when you want representations to live in a similarity space.

For categorical targets with a fixed class set, though, cross-entropy is the default — and the reason it is the default is that gradient descent on it has all the right asymptotic properties: it converges to the maximum-likelihood estimate, and on calibrated data the resulting matches the true conditional distribution.

Go further

Is cross-entropy the same as negative log-likelihood?

When the target is a one-hot label (a single correct class), yes — the sum collapses to -log q(y), which is exactly the negative log-likelihood of the correct class. The two names describe the same loss from two different angles: information-theoretic (cross-entropy) versus statistical (NLL).

How does it relate to KL divergence?

Cross-entropy H(p, q) = H(p) + KL(p || q). Since H(p) is fixed by the data, minimizing cross-entropy is equivalent to minimizing the KL divergence from p to q. That is why cross-entropy works as a learning signal at all.

Why does cross-entropy pair with softmax?

Softmax turns raw logits into a probability distribution q. Cross-entropy then measures how far q is from the target. The composition has a clean gradient — q − p — which is what makes the pair the dominant choice for classification heads.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord