DPO (Direct Preference Optimization)

Also known as: direct preference optimization

TL;DR

DPO is the closed-form alternative to RLHF: optimize the LLM directly on pairwise preferences, with no separate reward model and no reinforcement learning loop. Simpler, more stable, and the default alignment recipe in 2026.

DPO — Direct Preference Optimization (Rafailov et al., 2023) — is the alignment method that displaced as the production default. It hits the same target (a model that produces preferred outputs) with a single supervised-style training stage and no reinforcement learning.

DIRECT PREFERENCE OPTIMIZATIONSame preferences, one stage, no reward model.RLHF · TWO STAGES · FOUR MODELSDPO · ONE STAGE · TWO MODELSPROMPT x"Write me a polite refusal."y_w"I’m not able to help with that, but…"y_l"No way, weird ask."PROMPT x"Write me a polite refusal."y_w"I’m not able to help with that, but…"y_l"No way, weird ask."POLICY πthe LLM being trainedREFERENCE π_reffrozen SFT snapshotREWARD MODEL r_φseparate transformerVALUE NETWORK V_ψPPO baselinefit r_φPPO+ KLREWARD MODEL OUTPUTσ(Δr_φ)Δr_φPOLICY πthe LLM being trainedREFERENCE π_reffrozen SFT snapshotimplicitr(x, y)DPO LOSS — DIRECTLY ON π−log σ(β·[log (π/π_ref)w − log (π/π_ref)l])y_w ↑y_l ↓margin β·Δ log (π/π_ref)4 MODELS · 2 STAGES×2 memory2 MODELS · 1 STAGE½ memory

The supervision shape — pairwise preferences — matters more than the optimizer that consumes it. DPO is the cleaner consumer; RLHF was the first.

The key insight

RLHF has the structure: train reward model on preferences, then optimize policy to maximize reward subject to a KL constraint. The optimum of that objective has a closed form:

Solving for in terms of :

The term cancels out in any pairwise comparison. So substituting this expression into the Bradley-Terry preference likelihood gives a loss purely in terms of policy log-probabilities — no separate reward model needed.

The DPO loss

Given preference data where is preferred over :

Read this as: increase the log-probability ratio of the winner (relative to the reference) and decrease it for the loser. The controls how aggressively the model can move from the reference — playing the role of the KL coefficient in RLHF.

Why this is so much easier

DPO advantages over RLHF
  • Two models instead of four. Policy and frozen reference. No reward model, no value network.
  • Looks like supervised learning. A loss over preference pairs, optimized with standard cross-entropy machinery. No PPO, no rollouts, no value baselines.
  • Stable. DPO training does not blow up the way PPO does. Hyperparameter sensitivity is orders of magnitude lower.
  • Composable with . You can DPO-tune a LoRA adapter on top of an instruction-tuned base in a few hours on a single GPU.

When DPO underperforms RLHF

The closed-form derivation assumes preferences are sampled from the same policy you’re training. When preferences come from a much weaker model (e.g., the SFT model, not the partially-aligned policy), DPO can underfit relative to a careful PPO setup. Online DPO and iterative DPO variants address this by re-sampling preferences from the current policy mid-training.

Where it fits in the broader alignment shape

The strategic picture: pairwise preferences are the right supervision signal for alignment ( ). RLHF was the first successful recipe that consumed them; DPO is the cleaner one. The supervision shape is more important than which optimizer you use to consume it.

The RLHF objective is . This is a constrained optimization over a probability distribution; standard variational methods give the closed-form optimum: . Solve for in terms of and you get , where is the partition function — the per- normalization constant.

The trick is that any pairwise comparison cancels . The Bradley-Terry preference probability plugs in the difference of two reward expressions, and appears in both, subtracting away. What’s left is a loss in terms of policy log-probabilities only — .

This is why the derivation feels almost too clean: the closed-form optimum of constrained KL maximization happens to have the same Bradley-Terry structure as the preference data, and the partition function happens to live only on the un-pairable side. The match was discovered, not designed.

The β in the DPO loss controls how aggressively the policy can move from the frozen reference. High β (e.g., β = 0.5-1.0) keeps the policy close to the reference and produces conservative, drift-resistant updates; low β (β = 0.01-0.1) lets the policy move freely and can over-fit on the preference dataset.

The standard recommendation is β = 0.1, but this is wildly task-dependent. For preference data that’s noisier or smaller, a higher β prevents overfitting to spurious patterns in the labels. For preference data that’s clean and large, a lower β extracts more signal — at the cost of risking reward hacking on whatever surface artifacts the preference labelers used.

Practical shortcut: train at β = 0.1 first, evaluate on held-out preferences plus a separate “general capability” benchmark (MMLU, AlpacaEval). If general capability has dropped — a tell-tale sign that DPO has eaten into the base model — increase β and retrain. This is the cheap version of an iterative DPO loop.

Go further

How is DPO equivalent to RLHF without a reward model?

Rafailov et al. showed that the optimum of the RLHF objective has a closed-form expression in terms of the policy and reference: . Substituting this into the Bradley-Terry preference model gives a loss that depends only on policy log-probabilities — no separate reward model needed.

What's the practical training-time difference?

RLHF requires four models in memory (policy, reference, reward, value) and a multi-stage pipeline. DPO needs two (policy, frozen reference) and a single stage of fine-tuning on pairwise data — looks more like supervised learning than reinforcement learning. Memory is ~half, training is more stable, hyperparameter tuning is easier.

What are IPO, KTO, ORPO?

DPO descendants. IPO fixes a length-bias issue. KTO trains on per-sample (good/bad) labels rather than pairs — useful when pairs are unavailable. ORPO collapses SFT and preference optimization into a single stage. All share DPO's no-reward-model, single-stage shape.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord