Perplexity

Also known as: PPL, perplexity score

TL;DR

Perplexity is the standard intrinsic metric for evaluating language models: the exponentiated average per-token cross-entropy loss on held-out text. Lower is better.

Perplexity is the workhorse intrinsic eval for language models . Given a model and a held-out test corpus, it measures how well the model predicts the corpus — equivalently, how surprised the model is by what actually appears.

Definition

For a sequence of tokens and a model :

The inner sum is the cross-entropy loss in nats; dividing by averages it per token; exponentiating turns the log-likelihood into something interpretable. The minimum possible perplexity is 1 (the model always assigns probability 1 to the correct next token). A uniform-over-vocabulary model has perplexity equal to vocab size .

Intuition: branching factor

A useful informal interpretation: perplexity is the effective number of equally likely options the model is choosing among at each step. Perplexity of 10 means “on average, the model behaves as if it were uniformly guessing among 10 options” — even if it’s making sharper distinctions in practice. This makes the units of the metric actually mean something instead of being abstract probabilities.

For comparison:

A uniform random English-word model: perplexity ~50,000+
A smoothed n-gram model: perplexity ~100
GPT-2 on WikiText: perplexity ~30
Strong modern LLMs on diverse held-out text: perplexity 5–10
Theoretical lower bound (Shannon’s entropy of English): perplexity ~1.5–2 per character (very roughly)

The same number wears three hats. Cross-entropy is the average per-token negative log-likelihood under the model: . In nats this is what you minimize during training. Bits per token is the same quantity but in base 2: . Perplexity is . So a perplexity of 8 = 3 bits per token = a cross-entropy of 2.08 nats. The exponential makes perplexity intuitive (“effective branching factor”); the log-space numbers are easier to add and average. Training graphs typically plot loss directly because additive properties matter for tracking improvement, but reported numbers are exponentiated for the units to mean something.

What it captures and what it misses

Perplexity correlates with model quality during pretraining — every doubling of the training corpus, or doubling of model size, drops it predictably. The scaling laws (Kaplan, Chinchilla) are formulated in terms of cross-entropy / perplexity. Watching perplexity go down on held-out data is how you know pretraining is working.

It misses a lot:

Instruction following. Perplexity on user instructions doesn’t measure whether the model actually does what they ask.
Truthfulness. A model can be confidently wrong with low perplexity if its training data was confidently wrong.
Reasoning. Multi-step inference quality is invisible to a per-token metric.
Alignment. Refusals, harmlessness, and helpfulness aren’t perplexity properties.
Calibration. A well-calibrated model and an over-confident one can have similar perplexity.

Past basic competence, perplexity is a weak signal for downstream usefulness. The frontier-LLM eval ecosystem reflects this: MMLU, HumanEval, MT-Bench, Chatbot Arena, and human preference judgments dominate; perplexity is reported as a sanity-check, not a headline metric.

Tokenizer dependence

Perplexity is per-token, which means it’s not directly comparable across models with different tokenizers . A model with a 256K-vocab tokenizer that produces fewer tokens per character will tend to have lower perplexity than a model with a 32K-vocab tokenizer over the same content — without being a better model. Bits-per-byte (BPB) normalizes by character/byte count and is the right comparison metric across tokenizer families.

Where perplexity is still load-bearing

Pretraining loss curves. Domain adaptation evals (perplexity on legal vs general text tells you whether fine-tuning generalizes). Detecting training data leakage (a sharp perplexity drop on a held-out corpus suggests the model has seen it). It’s a core diagnostic — just not a final answer.

Where perplexity earns its keep

Pretraining run health — sanity-check that loss is decreasing as expected
Domain transfer — perplexity on a target domain before / after continued pretraining
Tokenizer bake-off — bits-per-byte across tokenizers on the same corpus
Data contamination detection — sharp perplexity drops on suspect held-out sets
Quantization regression tests — fp16 vs int8 vs int4 perplexity on a fixed corpus

Go further

What does a perplexity of 20 actually mean?

Roughly: the model is as uncertain about the next token as if it were uniformly choosing among 20 equally likely options. A bigram model on English might score 100; a strong modern LLM scores 5–10 on diverse text. The number is interpretable as 'effective branching factor' — useful intuition, even if not literally true.

Logits Tokenization

Why doesn't lower perplexity always mean a better LLM?

Perplexity rewards average next-token accuracy. It doesn't measure whether the model follows instructions, refuses bad requests, reasons through multi-step problems, or [hallucinates](/concepts/hallucination/). Two models can have similar perplexity and very different downstream usefulness. Modern LLM evals lean on benchmarks (MMLU, HumanEval, MT-Bench) and human preference, not perplexity alone.

Hallucination MTEB

Can you compare perplexity across different tokenizers?

Not directly. Perplexity is a per-token metric, and different tokenizers split text into different numbers of tokens for the same content. To compare across tokenizers, normalize by character or byte: 'bits per byte' is the canonical tokenizer-invariant alternative.

Tokenization Pretraining

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs