DPO (Direct Preference Optimization)

Also known as: direct preference optimization

TL;DR

DPO is the closed-form alternative to RLHF: optimize the LLM directly on pairwise preferences, with no separate reward model and no reinforcement learning loop. Simpler, more stable, and the default alignment recipe in 2026.

DPO — Direct Preference Optimization (Rafailov et al., 2023) — is the alignment method that displaced RLHF as the production default. It hits the same target (a model that produces preferred outputs) with a single supervised-style training stage and no reinforcement learning.

The supervision shape — pairwise preferences — matters more than the optimizer that consumes it. DPO is the cleaner consumer; RLHF was the first.

The key insight

RLHF has the structure: train reward model on preferences, then optimize policy to maximize reward subject to a KL constraint. The optimum of that objective has a closed form:

Solving for in terms of :

The term cancels out in any pairwise comparison. So substituting this expression into the Bradley-Terry preference likelihood gives a loss purely in terms of policy log-probabilities — no separate reward model needed.

The DPO loss

Given preference data where is preferred over :

Read this as: increase the log-probability ratio of the winner (relative to the reference) and decrease it for the loser. The controls how aggressively the model can move from the reference — playing the role of the KL coefficient in RLHF.

Why this is so much easier

DPO advantages over RLHF

Two models instead of four. Policy and frozen reference. No reward model, no value network.
Looks like supervised learning. A loss over preference pairs, optimized with standard cross-entropy machinery. No PPO, no rollouts, no value baselines.
Stable. DPO training does not blow up the way PPO does. Hyperparameter sensitivity is orders of magnitude lower.
Composable with LoRA . You can DPO-tune a LoRA adapter on top of an instruction-tuned base in a few hours on a single GPU.

When DPO underperforms RLHF

The closed-form derivation assumes preferences are sampled from the same policy you’re training. When preferences come from a much weaker model (e.g., the SFT model, not the partially-aligned policy), DPO can underfit relative to a careful PPO setup. Online DPO and iterative DPO variants address this by re-sampling preferences from the current policy mid-training.

Where it fits in the broader alignment shape

The strategic picture: pairwise preferences are the right supervision signal for alignment ( as for retrieval ). RLHF was the first successful recipe that consumed them; DPO is the cleaner one. The supervision shape is more important than which optimizer you use to consume it.

The RLHF objective is . This is a constrained optimization over a probability distribution; standard variational methods give the closed-form optimum: . Solve for in terms of and you get , where is the partition function — the per- normalization constant.

The trick is that any pairwise comparison cancels . The Bradley-Terry preference probability plugs in the difference of two reward expressions, and appears in both, subtracting away. What’s left is a loss in terms of policy log-probabilities only — .

This is why the derivation feels almost too clean: the closed-form optimum of constrained KL maximization happens to have the same Bradley-Terry structure as the preference data, and the partition function happens to live only on the un-pairable side. The match was discovered, not designed.

The β in the DPO loss controls how aggressively the policy can move from the frozen reference. High β (e.g., β = 0.5-1.0) keeps the policy close to the reference and produces conservative, drift-resistant updates; low β (β = 0.01-0.1) lets the policy move freely and can over-fit on the preference dataset.

The standard recommendation is β = 0.1, but this is wildly task-dependent. For preference data that’s noisier or smaller, a higher β prevents overfitting to spurious patterns in the labels. For preference data that’s clean and large, a lower β extracts more signal — at the cost of risking reward hacking on whatever surface artifacts the preference labelers used.

Practical shortcut: train at β = 0.1 first, evaluate on held-out preferences plus a separate “general capability” benchmark (MMLU, AlpacaEval). If general capability has dropped — a tell-tale sign that DPO has eaten into the base model — increase β and retrain. This is the cheap version of an iterative DPO loop.

Go further

How is DPO equivalent to RLHF without a reward model?

Rafailov et al. showed that the optimum of the RLHF objective has a closed-form expression in terms of the policy and reference: . Substituting this into the Bradley-Terry preference model gives a loss that depends only on policy log-probabilities — no separate reward model needed.

RLHF Reward modeling

What's the practical training-time difference?

RLHF requires four models in memory (policy, reference, reward, value) and a multi-stage pipeline. DPO needs two (policy, frozen reference) and a single stage of fine-tuning on pairwise data — looks more like supervised learning than reinforcement learning. Memory is ~half, training is more stable, hyperparameter tuning is easier.

Fine-tuning LoRA / PEFT

What are IPO, KTO, ORPO?

DPO descendants. IPO fixes a length-bias issue. KTO trains on per-sample (good/bad) labels rather than pairs — useful when pairs are unavailable. ORPO collapses SFT and preference optimization into a single stage. All share DPO's no-reward-model, single-stage shape.

Pairwise preference Supervised fine-tuning

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs