Epoch

Also known as: training epoch, pass over the data

TL;DR

An epoch is one full pass over the training set. Classic deep learning trains for tens to hundreds of epochs; modern LLM pretraining is sub-1-epoch — every token is seen exactly once. Fine-tuning typically runs 1-10 epochs.

An epoch is a single full pass over the training dataset — the model sees every training example exactly once. If your dataset has examples and you train with batch size , one epoch is gradient-descent steps. Training for epochs means showing the model the data times.

Where the term came from

Epochs were the natural unit of training in pre-2015 deep learning, when datasets were small (CIFAR-10: 60K images, ImageNet: 1.2M images) and models were trained from scratch for many passes. A typical ImageNet ResNet recipe ran 90-300 epochs. Modern AI broke the convention: LLM corpora are large enough that the model never re-sees most tokens. The unit drifted from epochs to tokens for pretraining and steps for everything else. When a 2026 paper says “trained on 15 trillion tokens,” that is the training description — epoch count isn’t mentioned.

The modern landscape

Regime	Typical epochs	Why
LLM pretraining	under 1 (often 0.3-0.7)	Token-limited; corpora too large to re-read
LLM continued pretraining	1-2	Domain adaptation on smaller corpora
Supervised fine-tuning (SFT)	1-3	Small datasets, visible overfitting beyond ~3
Preference optimization (DPO, RLHF)	1-2	Tiny datasets, very prone to overfit
LoRA fine-tuning	3-10	Fewer trainable params buys overfitting headroom
Embedding model training	1-10	Depends heavily on contrastive recipe
Image classification	50-300	Small datasets relative to model capacity
ImageNet from scratch	90-300	Convention; scales with regularization recipe

The trend over the last decade is fewer epochs as datasets grow. Chinchilla showed compute is better spent on more unique tokens than re-reading the same ones, which set the LLM pretraining default. In fine-tuning, repeated exposure to a small SFT dataset is the most common path to overfitting.

Sometimes. Once you’ve exhausted the high-quality unique tokens and still have compute budget, the question becomes: train a smaller model on more unique tokens, or train a larger model that re-sees tokens?

Empirical work in the post-Chinchilla era shows that repeating data up to ~4 epochs is roughly as effective as adding fresh tokens of comparable quality. Beyond that, repetitions add diminishing value and eventually start to hurt — the model overfits to specific tokens and loses generalization.

So in practice: high-quality tokens are repeated up to a few epochs in modern LLM training, while lower-quality tokens are seen exactly once or filtered out. The aggregate is “less than one epoch over the union, but multiple epochs over the curated subset.”

Watching the validation loss

The classical use of an epoch was as the unit on which you check validation loss to detect overfitting: train E epochs, evaluate after each, stop when it starts climbing. Early stopping at the best validation epoch was the default regularization tool. For modern LLMs the equivalent is evaluation steps — checkpoint every K thousand steps, calibrated so you get ~10-50 evaluation points across the run.

Go further

Why is LLM pretraining usually less than one epoch?

Frontier-LLM training runs are token-limited, not data-limited. The Chinchilla scaling laws and their successors say a fixed compute budget is best spent on more unique tokens rather than re-reading the same tokens. With trillions of high-quality tokens available, training corpora are large enough that one full pass exhausts the compute budget.

Pretraining Scaling laws

How many epochs should fine-tuning run?

1-3 epochs is a strong default for SFT on tens of thousands of examples; longer often overfits visibly. For very small SFT datasets (hundreds of examples), 3-10 epochs is reasonable but the validation loss should be watched closely. LoRA tolerates more epochs than full fine-tuning because there are fewer parameters to overfit.

Fine-tuning

What's the difference between an epoch and a step?

A step is a single gradient update — one forward pass, one backward pass, one optimizer step on one batch. An epoch is the number of steps it takes to walk through the training set once: dataset_size / batch_size steps per epoch. Modern training runs are usually counted in steps or tokens, not epochs.

Batch size

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs