Also known as: grad clip, gradient norm clipping, global norm clipping
TL;DR
Gradient clipping caps the norm of the gradient before applying the optimizer step, preventing rare but catastrophic large gradients from blowing up training. The modern default is global-norm clipping at threshold 1.0.
It computes the global 2-norm of the concatenated gradient across every parameter, and if that norm exceeds the threshold , rescales the entire gradient by . The direction is preserved; only the magnitude is capped. The optimizer then applies the clipped gradient as usual.
This is a defense against exploding gradients — rare events where the backpropagation pass returns a gradient many times larger than typical, often due to numerical issues, a bad data point, or transient instability in the loss landscape. Without clipping, a single such spike can move the model so far it never recovers.
The exploding-gradient problem
Deep networks compose many operations in series. The chain rule means gradients are products of local Jacobians at every layer. If those Jacobians have norm consistently above 1, the gradient grows exponentially with depth — the textbook “exploding gradients” failure mode. Below 1, gradients vanish.
Modern architectures (residual connections, LayerNorm, careful initialization) make exploding gradients much rarer than in pre-2017 RNNs. But “rarer” is not “never.” Across a billion-step pretraining run, the rare bad gradient — from a corrupt training example, a numerical edge case, or a bad neighborhood in the loss landscape — is statistically certain to appear. Without a guard, one such step can permanently corrupt the model and force a rollback to a checkpoint hours earlier.
Clipping is the cheap insurance: it costs almost nothing per step and bounds the worst-case damage from any single update.
Why threshold 1.0
The threshold c = 1.0 has emerged as the de-facto default across virtually every transformer training recipe — GPT-3, Llama, Mistral, Qwen, Claude, and most public open-source stacks. The exact value is not load-bearing; the clip rarely triggers in well-tuned, stable training. What matters is that it exists. A run with c = 1.0 and a run with c = 5.0 produce nearly identical loss curves until one of them sees a spike — at which point the lower-threshold run survives and the higher-threshold run might not.
Value clipping (also called “element-wise clipping”) clamps each component of the gradient independently:
g_i ← max(min(g_i, c), -c) for every i
If only some components exceed the threshold, those get clamped while others don’t. The result is a gradient that points in a different direction from the original. Geometrically, you’ve projected onto the box [-c, c]^n, which is not a direction-preserving operation.
Norm clipping rescales the whole vector by a single factor:
if ||g|| > c: g ← g · (c / ||g||)
The result is a gradient pointing in exactly the same direction as the original, just shorter. This matters because the gradient direction is the part that carries the descent information; the magnitude is just step size. Distorting direction is the kind of subtle damage that hurts convergence in ways that are hard to debug.
Value clipping was used in some early RNN recipes but is essentially extinct in modern training.
When clipping triggers a lot
If your training run is hitting the clip threshold frequently — say, more than 5-10% of steps — that’s a signal, not a solution. Frequent clipping means your raw gradients are routinely large, which usually points at:
Learning rate too high — the most common cause. Halve it and watch.
Insufficient warmup — early steps have undertrained optimizer state and produce noisy updates. Lengthen warmup.
Bad data — pathological training examples (all-zero, extremely long, encoded incorrectly) can produce large losses and large gradients. Inspect the inputs that trigger the largest pre-clip norms.
Numerical instability — float16 underflow in attention or LayerNorm can cause spikes. Mixed-precision training with bfloat16 or float8 dynamic-loss-scaling typically resolves this.
Go further
Why is the threshold almost always 1.0?
Empirical convergence across labs: most modern transformer recipes (GPT-3, Llama, Qwen, Mistral, Claude) clip global gradient norm at 1.0. The exact value isn't load-bearing — the clip rarely triggers under stable training conditions. It exists to catch the rare 1-in-10000-step spike that would otherwise corrupt the model.
What's the difference between value clipping and norm clipping?
Value clipping caps each gradient component independently at some bound — it changes the direction of the gradient when it triggers, which is bad. Norm clipping rescales the whole gradient vector to have norm at most c, preserving direction. Norm clipping is the default; value clipping is essentially deprecated in modern recipes.
Should you clip per-parameter, per-layer, or globally?
Globally — compute one norm across the concatenated gradient of all parameters, scale them all by the same factor. Per-parameter or per-layer clipping distorts the relative magnitudes the optimizer expects. Every standard framework (PyTorch's clip_grad_norm_, JAX's optax) defaults to global-norm clipping for this reason.