PPO (Proximal Policy Optimization)

Also known as: Proximal Policy Optimization, PPO, clipped policy gradient

TL;DR

A clipped policy-gradient algorithm that keeps each update close to the previous policy via a clip on the importance-sampling ratio. The standard RL optimizer for RLHF — Schulman et al. 2017, OpenAI — and the algorithm GPT-3.5/4 and Llama-2 were aligned with.

PPO — Proximal Policy Optimization (Schulman et al., 2017, OpenAI) — is a clipped policy-gradient algorithm that constrains each update to stay close to the previous policy by clipping the importance-sampling ratio between new and old action probabilities. For LLMs, it is the reinforcement-learning algorithm that fine-tunes a model against a scalar reward signal: the classical RLHF optimizer behind InstructGPT, GPT-3.5, early GPT-4, Claude-1, and Llama-2-Chat.

The clipped objective

PPO maximizes a pessimistic surrogate of the policy-gradient objective:

where is the policy ratio, is the advantage at step , and is the clip range — typically 0.1–0.3. The of clipped and unclipped objectives means PPO never gets rewarded for moving farther than past the previous policy when the advantage is positive, nor farther than below when the advantage is negative.

PPO’s central trick is trust-region without computing the trust region. The clip turns the unbounded policy-gradient objective into a pessimistic objective that never rewards moves more than past the previous policy ratio, preventing the destructive overshooting of vanilla policy gradients.

The four moving parts of PPO-for-RLHF

PPO's four moving parts in RLHF

Policy — the LLM being aligned. The actor in the actor-critic split; the only network whose weights ultimately ship.
Value network — same backbone, separate scalar head, estimates for advantage computation. Updated alongside the policy via a regression loss against returns.
Reward model — a separately-trained reward model , frozen during PPO; scores complete sequences (terminal reward at the end-of-sequence token).
Reference policy — the SFT model, frozen; provides the per-token KL divergence penalty that anchors the policy.

The combined per-token reward is

𝟙

The reward-model term fires only at the terminal token; the KL term fires at every token. Without the KL anchor, the policy reward-hacks: outputs look like adversarial gibberish that the reward model nonetheless scores high. The KL coefficient is the hyperparameter most teams report as the hardest to tune in the entire RLHF stack — too low gives reward hacking, too high gives no movement off the SFT initialization.

The legacy story

PPO is what RLHF was until 2024. InstructGPT (2022), GPT-3.5, GPT-4 (early), Claude-1, and Llama-2-Chat were all aligned with PPO. The 2024 shift to DPO / IPO / KTO came from operational complexity, not theoretical superiority: PPO has four networks in memory simultaneously, requires online rollouts during training, and is roughly 3× more expensive per training token than DPO. The hyperparameter surface — clip range , KL coefficient , GAE , value loss coefficient, learning rates for actor and critic — is large and badly conditioned. Most teams that succeeded with PPO had dedicated infrastructure and weeks of sweeps.

What PPO still dominates: settings where the reward signal is dense, structured, or programmatically verifiable. Math/code/reasoning training (GRPO, REINFORCE++ and other PPO descendants), tool-use training, and any setting where step-level rewards are available. Constitutional AI originally used a PPO loop against an AI-feedback reward model. The supervision shape — terminal pairwise preference — was what DPO exploited; the moment supervision becomes per-step, PPO returns.

Generalized Advantage Estimation (Schulman et al., 2016) interpolates between Monte Carlo (high variance, low bias) and TD(1) (low variance, high bias) advantage estimators via a parameter :

where is the TD residual. At this is the Monte Carlo return minus the baseline; at it is the one-step TD residual. Typical values are , for LLM RLHF.

Without GAE, the policy gradient is unusably high-variance on long sequences: a single trajectory-level reward gets credited equally to every token, drowning the signal in noise. GAE is what makes gradient descent on tractable at LLM scale.

Concrete failure modes seen in production PPO runs: outputs degenerate into reward-model exploits — specific token patterns the RM happens to score high, sycophantic agreement, repeated phrasings, formatting tics like bullet-point spam, confident-sounding hallucinations. The KL from reference rises sharply while the reward-model score also rises; on held-out human evaluation, quality drops.

The fix is some combination of: a higher KL coefficient , a better reward model (more diverse training data, stronger base), entropy regularization to keep the policy from collapsing to a few high-reward modes, gradient clipping to prevent destructive updates, and shorter training (PPO often peaks early and degrades). Usually all of these together. The art of RLHF in 2022–2024 was the art of fine-tuning these knobs in concert; the practical appeal of DPO is that none of them exist.

Go further

Why does PPO need a separate value network?

PPO is an actor-critic algorithm — the actor is the LLM policy, the critic is a value network that estimates expected return. GAE (Generalized Advantage Estimation) uses the value network to drive down the variance of the policy gradient. The value head is usually a one-layer head on the LLM trunk or a separate small network.

What's the role of the KL penalty?

PPO for RLHF adds a per-token KL penalty from a frozen reference policy (the SFT model) to the reward. Without it, the policy drifts to reward-hack — high reward, garbage outputs. The KL coefficient is one of the hardest-to-tune knobs in the whole RLHF stack.

KL divergence RLHF

Why did DPO replace PPO for most alignment?

PPO has four moving parts (policy, value, reward model, reference) and many fragile hyperparameters (KL coefficient, clip range, GAE lambda, value loss coefficient). DPO collapses the four into one supervised loss directly on pairwise preferences — same optimum under the Bradley-Terry assumption, dramatically simpler to tune.

DPO

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs