Gradient Descent

Also known as: GD, stochastic gradient descent, SGD, mini-batch SGD

TL;DR

Gradient descent is the iterative optimization procedure that powers virtually all of deep learning. Compute the gradient of the loss with respect to parameters, take a small step in the opposite direction, repeat.

Gradient descent is the iterative update rule that trains nearly every neural network in production. Given a loss function over parameters , you compute the gradient — a vector pointing in the direction of steepest increase — and take a small step in the opposite direction:

θ ← θ - η ∇L(θ)

is the learning rate — a small positive number controlling step size. Repeat this for billions of steps and the parameters drift toward a region of low loss. That’s the entire algorithm.

SGD vs full-batch

Computing over the entire training set is called full-batch gradient descent. For an LLM with trillions of training tokens, that’s not a thing you can do once, let alone every step. Instead you sample a batch — anywhere from 32 examples to a few million tokens — compute the gradient over just that batch, and update. This is stochastic gradient descent (SGD), and the gradient it produces is a noisy unbiased estimate of the true full-batch gradient.

The noise is a feature, not a bug. It helps the optimizer escape sharp local minima and acts as an implicit regularizer. Models trained with smaller, noisier batches generalize measurably better than ones trained with very large batches — which is why “noise is regularization” became received wisdom.

Why local minima are mostly fine

Two reasons GD doesn’t get stuck where you’d expect.

First, in very high dimensions, the loss surface is dominated by saddle points — places where the loss is a minimum along some directions and a maximum along others. True local minima with no descent direction are rare, and SGD’s noise pushes through saddles in finite time.

Second, the local minima that do exist are mostly equivalent. Models trained from different random seeds end up at loss values within a fraction of a percent of each other, despite reaching nominally different points in parameter space. The loss landscape of a wide deep network is flatter at the bottom than the textbook 1D pictures suggest.

The gradient is the vector of partial derivatives . Each component tells you how much the loss changes per unit change in that one parameter, holding all other parameters fixed.

For a network with billions of parameters, the gradient has billions of components — same shape as itself. Computing it efficiently is the whole point of backpropagation , which uses the chain rule to share intermediate computations across all those partial derivatives in a single backward pass.

The intuition: the gradient points uphill in the steepest direction. Negate it, and you’re heading downhill. Take a step proportional to its magnitude (scaled by ), and you’ve made progress.

Variants you’ll meet

Plain SGD is rarely used directly on modern LLMs. Instead, every production training run wraps GD in an optimizer — Adam, AdamW, Lion — that tracks per-parameter statistics (running mean and variance of the gradient) and uses them to scale each parameter’s update. The core update rule is still , where the “something” gets cleverer.

Go further

Why don't local minima ruin deep-network training?

In high dimensions, almost every critical point is a saddle, not a true local minimum. Local minima do exist but the loss values at most of them are nearly indistinguishable from the global minimum, and SGD's noise easily escapes shallow ones. Empirically, models trained from different seeds reach loss values within fractions of a percent of each other.

Optimizer Learning rate

What's the difference between full-batch GD, SGD, and mini-batch SGD?

Full-batch uses the entire training set per step (clean gradient, infeasible at scale). Pure SGD uses one example (extremely noisy). Mini-batch SGD averages the gradient over a small batch — say 32 to several million examples — and is what everybody actually means when they say 'SGD' in 2026.

Batch size

Where does the gradient itself come from?

From backpropagation: the chain rule applied recursively through the network's computation graph. Modern autodiff frameworks (PyTorch, JAX) construct this graph for you and compute every parameter's gradient in a single backward pass.

Backpropagation

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs