Epoch

Also known as: training epoch, pass over the data

TL;DR

An epoch is one full pass over the training set. Classic deep learning trains for tens to hundreds of epochs; modern LLM pretraining is sub-1-epoch — every token is seen exactly once. Fine-tuning typically runs 1-10 epochs.

An epoch is a single full pass over the training dataset — the model sees every training example exactly once. If your dataset has examples and you train with , one epoch is steps. Training for epochs means showing the model the data times.

EPOCH · ONE FULL PASS OVER THE TRAINING SETN examples, B per batch, N/B steps, one pass.DATASET · 32 EXAMPLESB1B2B3B4B5B6B7B8LOSS VS GRADIENT STEPS · 4 EPOCHSepoch 1epoch 2epoch 3epoch 4loss1 epoch=N / B=32 / 4 = 8steps

Where the term came from

Epochs were the natural unit of training in pre-2015 deep learning, when datasets were small (CIFAR-10: 60K images, ImageNet: 1.2M images) and models were trained from scratch for many passes. A typical ImageNet ResNet recipe ran 90-300 epochs. Modern AI broke the convention: LLM corpora are large enough that the model never re-sees most tokens. The unit drifted from epochs to tokens for and steps for everything else. When a 2026 paper says “trained on 15 trillion tokens,” that is the training description — epoch count isn’t mentioned.

The modern landscape

RegimeTypical epochsWhy
LLM pretrainingunder 1 (often 0.3-0.7)Token-limited; corpora too large to re-read
LLM continued pretraining1-2Domain adaptation on smaller corpora
Supervised fine-tuning (SFT)1-3Small datasets, visible overfitting beyond ~3
Preference optimization (DPO, RLHF)1-2Tiny datasets, very prone to overfit
LoRA fine-tuning3-10Fewer trainable params buys overfitting headroom
Embedding model training1-10Depends heavily on contrastive recipe
Image classification50-300Small datasets relative to model capacity
ImageNet from scratch90-300Convention; scales with regularization recipe

The trend over the last decade is fewer epochs as datasets grow. showed compute is better spent on more unique tokens than re-reading the same ones, which set the LLM pretraining default. In fine-tuning, repeated exposure to a small SFT dataset is the most common path to overfitting.

Sometimes. Once you’ve exhausted the high-quality unique tokens and still have compute budget, the question becomes: train a smaller model on more unique tokens, or train a larger model that re-sees tokens?

Empirical work in the post-Chinchilla era shows that repeating data up to ~4 epochs is roughly as effective as adding fresh tokens of comparable quality. Beyond that, repetitions add diminishing value and eventually start to hurt — the model overfits to specific tokens and loses generalization.

So in practice: high-quality tokens are repeated up to a few epochs in modern LLM training, while lower-quality tokens are seen exactly once or filtered out. The aggregate is “less than one epoch over the union, but multiple epochs over the curated subset.”

Watching the validation loss

The classical use of an epoch was as the unit on which you check validation loss to detect overfitting: train E epochs, evaluate after each, stop when it starts climbing. Early stopping at the best validation epoch was the default regularization tool. For modern LLMs the equivalent is evaluation steps — checkpoint every K thousand steps, calibrated so you get ~10-50 evaluation points across the run.

Go further

Why is LLM pretraining usually less than one epoch?

Frontier-LLM training runs are token-limited, not data-limited. The Chinchilla scaling laws and their successors say a fixed compute budget is best spent on more unique tokens rather than re-reading the same tokens. With trillions of high-quality tokens available, training corpora are large enough that one full pass exhausts the compute budget.

How many epochs should fine-tuning run?

1-3 epochs is a strong default for SFT on tens of thousands of examples; longer often overfits visibly. For very small SFT datasets (hundreds of examples), 3-10 epochs is reasonable but the validation loss should be watched closely. LoRA tolerates more epochs than full fine-tuning because there are fewer parameters to overfit.

What's the difference between an epoch and a step?

A step is a single gradient update — one forward pass, one backward pass, one optimizer step on one batch. An epoch is the number of steps it takes to walk through the training set once: dataset_size / batch_size steps per epoch. Modern training runs are usually counted in steps or tokens, not epochs.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord