Grokking

Also known as: grok, delayed generalization

TL;DR

Grokking is the training-dynamics phenomenon where a model first memorizes the training set, then much later — often suddenly — learns to generalize to held-out data.

Grokking is the training-dynamics phenomenon where a neural network first memorizes its training set — perfect training accuracy, near-zero training loss — and then, much later in training, suddenly learns to generalize to held-out data. The validation accuracy can stay at chance for thousands of optimizer steps after training accuracy is perfect, then jump abruptly to 100%. The phenomenon was named by Power et al. (2022) at OpenAI and is one of the cleanest empirical demonstrations that training does not do one thing at one time.

The original observation

The 2022 paper trained small transformers on modular arithmetic — given and , output . With a small fraction of all pairs as training data, the standard story would say: either the model has enough capacity to memorize and won’t generalize, or it doesn’t and will fail outright.

What actually happened:

Phase 1 (steps 0 - ~10³). Training accuracy climbs to 100%; validation accuracy stays near chance. This is classical overfitting — memorization.
Phase 2 (steps ~10³ - ~10⁵). Training accuracy stays at 100%; training loss inches lower; validation accuracy stays at chance. The model looks “stuck.”
Phase 3 (around step ~10⁵). Validation accuracy snaps from chance to 100% in a small number of steps. The model has discovered the rule.

The transition is fast relative to the long flat region preceding it. Training the model 100× longer than it took to memorize was necessary to see generalization emerge.

What’s happening internally

Mechanistic-interpretability work on grokking (Nanda et al., among others) suggests that two solutions are competing during training:

A memorization solution that fits the training set by storing each input-output pair. Reachable fast by SGD; high norm.
A generalization solution that implements the actual algorithm (in the modular-arithmetic case, a discrete Fourier transform structure that computes for any input). Slower for SGD to find; low norm.

The memorization solution wins early because it’s locally easier to reach. Weight decay slowly penalizes the high-norm memorization solution, pushing weights toward the lower-norm generalization solution. Once the optimizer crosses into a basin that supports the algorithmic solution, generalization emerges sharply.

This story explains why grokking depends on weight decay (without it, the memorization solution is stable forever) and why the transition is sudden (the algorithmic solution is structurally different, not a smooth interpolation).

What grokking suggests about training

Memorization and generalization can be temporally distinct phases
The training loss alone is not a sufficient signal of model quality
Long training past apparent convergence can produce qualitative jumps
Implicit regularization (weight decay + SGD) is doing structural work over very long timescales
Validation accuracy can be discontinuous in training time

Two reasons. First, grokking is one of the cleanest pieces of evidence that overparameterized neural networks have internal phase transitions during training — they’re not just slowly improving; they’re occasionally restructuring. This puts strong pressure on the simple intuition that “loss going down = capability going up.” Second, related phenomena show up at scale: emergent capabilities in large language models (where a benchmark sits near chance until some scale, then jumps), reasoning skills that appear suddenly during fine-tuning, and the general observation that training past apparent convergence sometimes still pays off. Whether scaled-up grokking is “really” grokking is debated, but the conceptual lesson — that training can do qualitatively distinct things sequentially — has held up.

Grokking is delayed, sudden generalization — the model memorizes first, then snaps into a generalizing solution thousands of steps later. It’s the cleanest empirical evidence that training has phases, and that loss-curve intuitions can mislead you about what the model has actually learned.

Go further

Where did the term come from?

Power, Burda, Edwards, Babuschkin, and Misra at OpenAI introduced grokking in a 2022 paper studying small transformers learning modular arithmetic. They showed that on tasks like 'compute (a + b) mod p,' the model would memorize training examples in a few thousand steps, then sit at near-random validation accuracy for tens of thousands more steps, then abruptly hit 100% validation accuracy.

Overfitting

Is grokking just slow learning?

No — that's the surprising part. The transition is sharp, not gradual. Validation accuracy can stay at chance for 10,000+ steps after training accuracy is perfect, then jump to 100% in a few hundred steps. The training loss is already zero and barely moves; it's some other internal change driving the generalization.

Double descent

Does grokking matter for real LLMs?

It's debated. The original setup is a tiny transformer on a synthetic algebraic task, which is far from production training. But the broader insight — that memorization and generalization can be temporally separated, and that long training past 'apparent convergence' can produce qualitative jumps — does seem to show up in scaled training, particularly on reasoning benchmarks where capabilities emerge non-monotonically.

Bias-variance tradeoff

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs