What's the loss function for a reward model?
Bradley-Terry loss on pairwise preferences: given
Also known as: reward model, RM training, preference model
Training a model that predicts a scalar quality or preference score for an LLM's output. The backbone of RLHF — the reward model is what the LLM optimizes against.
A reward model
A reranker is a reward model for relevance. Same architecture, same Bradley-Terry training shape, same calibration concerns — different domain.
Standard recipe: pairwise preference data. For prompt
The reward model outputs scalars; the difference of scalars predicts the pairwise preference probability. Same statistical shape as the Thurstone model for chess Elo. The reward model never has to predict an absolute score — only relative ones — which is why pairwise data is the right supervision shape.
Almost always a transformer with a scalar output head replacing the language-modeling head. Common starting points: the SFT model itself (so the reward model has the same world model as the policy), or a separately-trained classifier-style transformer. Initialization from a strong base matters — RM quality is bounded by the base model’s understanding of the task.
Reward models trained with pairwise loss are only constrained on differences of scores. The absolute scale is arbitrary — the model can output rewards in [-100, 100] or [0, 1] and the pairwise loss is identical. This is fine for ranking but breaks any threshold-based decision (“trust the reward if greater than 0.7”).
This is exactly the same calibration challenge faced by rerankers, and the fix is the same: pass pairwise preferences through a Thurstone fit to get continuous targets on a fixed scale, then train pointwise against those. zELO’s recipe applied to reward modeling produces calibrated reward models.
DPO showed that, for the specific purpose of LLM alignment, the closed-form RLHF optimum can be computed directly from the policy and reference. No separate reward model is needed; the policy implicitly is the reward model up to a constant. This collapses the RLHF pipeline by a stage and removes the over-optimization failure mode (you can’t over-optimize a reward signal that doesn’t exist as a separate model).
For alignment, this is a clean win. But reward modeling lives on in:
A standard reward model scores the final answer:
The motivation is failure-localization. A wrong final answer might come from a single bad step in a 10-step chain — the other 9 were correct, but the chain compounds the error. A PRM lets you (a) train the policy to assign credit per step, not per outcome, and (b) at inference, prune chains as soon as a low-quality step appears, saving compute on dead ends.
Process reward models are the supervision signal behind o1-style and DeepSeek-R1-style reasoning models. Training data is expensive — you need step-level annotations, often via tree search and rollout — but the resulting policies are dramatically better at multi-step math and code.
The Bradley-Terry loss is informative for training but uninterpretable in absolute terms — a “good” loss depends on the difficulty of the pairs, the calibration of the model, and the noise floor of the labels. So benchmarks evaluate by pairwise accuracy: across held-out (chosen, rejected) pairs, how often does the RM score chosen higher than rejected?
The standard suite is RewardBench (Lambert et al., 2024), which breaks accuracy down by category — chat, safety, reasoning, code. Strong RMs hit 90%+ on chat and 75-85% on reasoning; the gap reflects how much harder it is to discriminate quality on multi-step problems. Calibration is a separate axis and almost no public benchmarks measure it well — which is why most off-the-shelf reward models are rank-correct but not calibrated, and why production stacks that need thresholds (rejection sampling, refuse-or-answer routing) re-train their own RMs.
The standard benchmark is RewardBench (Lambert et al., 2024) — pairwise accuracy across chat, safety, reasoning, and code categories. Strong RMs hit 80-90%, weaker ones around 70%. Calibration is a separate axis, evaluated rarely and harder to get right than rank accuracy — which is why production stacks that need score thresholds (rejection sampling, refuse-or-answer routing) train their own RMs against Thurstone -recovered targets rather than reusing public ones.
Bradley-Terry loss on pairwise preferences: given
A reward model is an imperfect proxy for true human preference. Optimize against it hard enough and you find inputs the RM rates highly that humans actually wouldn't (Goodhart's law in action). The KL penalty in RLHF is meant to limit how far policy can drift; in practice, careful reward-model evaluation and conservative training are the real defenses.
A reranker is a reward model for relevance — same shape, different domain. zELO's pipeline trains a pointwise reranker against Thurstone-recovered targets, structurally identical to how a calibrated reward model is trained from pairwise preferences. The retrieval and alignment communities reinvented the same machinery in parallel.