How is DPO equivalent to RLHF without a reward model?
Rafailov et al. showed that the optimum of the RLHF objective has a closed-form expression in terms of the policy and reference:
Also known as: direct preference optimization
DPO is the closed-form alternative to RLHF: optimize the LLM directly on pairwise preferences, with no separate reward model and no reinforcement learning loop. Simpler, more stable, and the default alignment recipe in 2026.
DPO — Direct Preference Optimization (Rafailov et al., 2023) — is the alignment method that displaced RLHF as the production default. It hits the same target (a model that produces preferred outputs) with a single supervised-style training stage and no reinforcement learning.
The supervision shape — pairwise preferences — matters more than the optimizer that consumes it. DPO is the cleaner consumer; RLHF was the first.
RLHF has the structure: train reward model
Solving for
The
Given preference data
Read this as: increase the log-probability ratio of the winner (relative to the reference) and decrease it for the loser. The
The closed-form derivation assumes preferences are sampled from the same policy you’re training. When preferences come from a much weaker model (e.g., the SFT model, not the partially-aligned policy), DPO can underfit relative to a careful PPO setup. Online DPO and iterative DPO variants address this by re-sampling preferences from the current policy mid-training.
The strategic picture: pairwise preferences are the right supervision signal for alignment ( as for retrieval ). RLHF was the first successful recipe that consumed them; DPO is the cleaner one. The supervision shape is more important than which optimizer you use to consume it.
The RLHF objective is
The trick is that any pairwise comparison cancels
This is why the derivation feels almost too clean: the closed-form optimum of constrained KL maximization happens to have the same Bradley-Terry structure as the preference data, and the partition function happens to live only on the un-pairable side. The match was discovered, not designed.
The β in the DPO loss controls how aggressively the policy can move from the frozen reference. High β (e.g., β = 0.5-1.0) keeps the policy close to the reference and produces conservative, drift-resistant updates; low β (β = 0.01-0.1) lets the policy move freely and can over-fit on the preference dataset.
The standard recommendation is β = 0.1, but this is wildly task-dependent. For preference data that’s noisier or smaller, a higher β prevents overfitting to spurious patterns in the labels. For preference data that’s clean and large, a lower β extracts more signal — at the cost of risking reward hacking on whatever surface artifacts the preference labelers used.
Practical shortcut: train at β = 0.1 first, evaluate on held-out preferences plus a separate “general capability” benchmark (MMLU, AlpacaEval). If general capability has dropped — a tell-tale sign that DPO has eaten into the base model — increase β and retrain. This is the cheap version of an iterative DPO loop.
Rafailov et al. showed that the optimum of the RLHF objective has a closed-form expression in terms of the policy and reference:
RLHF requires four models in memory (policy, reference, reward, value) and a multi-stage pipeline. DPO needs two (policy, frozen reference) and a single stage of fine-tuning on pairwise data — looks more like supervised learning than reinforcement learning. Memory is ~half, training is more stable, hyperparameter tuning is easier.
DPO descendants. IPO fixes a length-bias issue. KTO trains on per-sample (good/bad) labels rather than pairs — useful when pairs are unavailable. ORPO collapses SFT and preference optimization into a single stage. All share DPO's no-reward-model, single-stage shape.