Dropout

Q: Why is p = 0.1 the modern default for transformers?

Original 2014 dropout used p = 0.5 on small fully-connected nets. Transformers are overparameterized differently — the residual stream is fragile and high dropout breaks training. Empirically p = 0.1 (or p = 0 at very large scale) gives the best test loss; modern frontier LLMs often use 0 dropout during pretraining and add it only for fine-tuning.

Also known as: dropout regularization, Hinton dropout

TL;DR

Dropout randomly zeroes a fraction of activations during training, forcing the network to spread its representations across many redundant paths instead of co-adapting onto a few. It is mostly off at inference.

Dropout is a regularization technique that randomly zeros each activation in a layer with probability during training. Surviving activations are scaled by so the expected magnitude is preserved. At inference, dropout is off and the full network runs deterministically. One line in modern frameworks; for nearly a decade, the default fix for overfitting in deep networks.

What it actually prevents

The original Srivastava & Hinton 2014 paper framed dropout as preventing co-adaptation: features that learn to rely on the presence of other specific features. Co-adapted features memorize the training set well but generalize poorly because they cannot function alone.

By randomly killing units each step, dropout forces every feature to be useful on its own — without being able to count on its neighbors. The network learns redundant, distributed representations: any single activation can disappear without breaking the prediction.

A second equivalent framing: dropout is implicit ensembling. Each minibatch trains a different random sub-network sampled from possible sub-networks (where is the number of activations). At inference the full network approximates an average over all of them. Ensembles reduce variance; dropout buys ensembling without paying for separate training runs.

Why `p = 0.1` for transformers

Vintage dropout used p = 0.5 on small fully-connected nets where overfitting was the dominant failure mode. Transformers are different: they are deep, residual, layer-normalized , and operate on a high-dimensional residual stream. High dropout damages the residual signal and slows training.

The empirically tuned default that emerged with BERT and GPT-2 is p = 0.1 on attention weights and on the feed-forward output. At very large scale — frontier-LLM pretraining on trillions of tokens — most labs use p = 0 because:

The dataset is so large that overfitting is not the bottleneck.
Each token is seen roughly once anyway, so the implicit ensemble effect is wasted.
Dropout slows convergence on large compute budgets.

Dropout often comes back during fine-tuning, where the dataset is small and overfitting is again the dominant risk.

Where dropout fits in the regularization stack

Regularizers in modern training

Weight decay — penalize parameter magnitudes; the dominant regularizer in modern LLM pretraining.
Dropout — stochastic activation masking; load-bearing for fine-tuning, often disabled at frontier-pretraining scale.
Early stopping — stop when validation loss climbs; cheap and effective on small data.
Data augmentation — add invariances by perturbing inputs; the dominant regularizer in vision.
Label smoothing — replace one-hot targets with a slightly smoothed distribution.

These are not mutually exclusive — production training recipes typically use weight decay plus a small dropout plus learning-rate decay, all at once.

Without rescaling, dropout would shift the expected output of a layer down by a factor of whenever it is active. At inference, with dropout off, activations would be larger by that same factor — a distribution shift between training and inference that breaks calibrated downstream layers.

The fix, called inverted dropout, is to scale survivors up during training:

y = (mask * x) / (1 - p)

Now regardless of . The training-time and inference-time activations have the same expected magnitude, so the rest of the network sees a consistent input distribution. Modern frameworks (PyTorch’s nn.Dropout, TensorFlow’s tf.nn.dropout) implement inverted dropout by default, which is why you do not see any explicit rescaling at inference.

A subtle consequence: gradient magnitudes during training are also scaled by . This means dropout interacts with optimizer hyperparameters — a higher dropout rate effectively raises the gradient scale, which can require lower learning rates to stay stable.

Dropout is mostly a relic of the pre-transformer era at the frontier scale, but it remains the right tool whenever you are fine-tuning a model on a small dataset, training a head on a frozen backbone, or doing any of the dozen other regimes where overfitting is the actual enemy. The line nn.Dropout(0.1) carries an entire generation of hard-won regularization wisdom.

Go further

Why is `p = 0.1` the modern default for transformers?

Original 2014 dropout used p = 0.5 on small fully-connected nets. Transformers are overparameterized differently — the residual stream is fragile and high dropout breaks training. Empirically p = 0.1 (or p = 0 at very large scale) gives the best test loss; modern frontier LLMs often use 0 dropout during pretraining and add it only for fine-tuning.

Transformer Pretraining

Why is dropout disabled at inference?

Dropout is a training-time stochastic operation. At inference you want a deterministic prediction, so the dropout layer becomes a pass-through. The implementation rescales activations during training (inverted dropout) so the expected magnitude stays constant whether dropout is on or off.

Vector

What is dropout doing in information-theoretic terms?

Each forward pass samples a different sub-network, and the full network's prediction averages over them — an implicit ensemble of 2^n sub-networks. This ensembling effect is what reduces variance and makes predictions more robust to single-feature failure.

Weight decay

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs