Reward Modeling

Also known as: reward model, RM training, preference model

TL;DR

Training a model that predicts a scalar quality or preference score for an LLM's output. The backbone of RLHF — the reward model is what the LLM optimizes against.

A reward model is a neural network — typically a transformer with a scalar output head — trained to predict how good response is for prompt . In , the reward model is what the LLM optimizes against; in retrieval, the plays the same role at inference time.

REWARD MODELINGPairs in, calibrated scalar out.PREFERENCE TRIPLES (x, y_w, y_l)REWARD MODELSCALAR SCORE r_φ(x, y)PROMPTExplain entropy.y_wa measure of uncertainty — high when outcomes …y_lit’s about disorder, kind of.PROMPTIs it safe to mix bleach and ammonia?y_wNo — it produces toxic chloramine gas. Do not …y_lSure, just open a window.PROMPTWhat is 17 × 24?y_w408 (17·20 + 17·4 = 340 + 68).y_lAround 400, I think.PROMPTSummarise the paper in one line.y_wThe authors show that pairwise loss recovers a…y_lIt’s a good paper about ranking I think.BLOCK 1BLOCK 2BLOCK 3BLOCK 4scalar headr_φ(x, y) ∈ ℝy_w+1.42y_l-0.31Δr = 1.73y_w+2.05y_l-1.86Δr = 3.91y_w+0.94y_l-0.22Δr = 1.16y_w+1.18y_l-0.45Δr = 1.63SHARED SCORE AXIS-3-2-10123BRADLEY-TERRY LOSSℒ = −log σ(Δr)high0y_l ≫Δr = 0y_w ≫gradient directionPAIRWISE ACCURACY4/4 · 100%

A reranker is a reward model for relevance. Same architecture, same Bradley-Terry training shape, same calibration concerns — different domain.

How it’s trained

Standard recipe: pairwise preference data. For prompt , sample two responses (preferred) and (not). The Bradley-Terry loss:

The reward model outputs scalars; the difference of scalars predicts the pairwise preference probability. Same statistical shape as the for chess Elo. The reward model never has to predict an absolute score — only relative ones — which is why pairwise data is the right supervision shape.

Architecture

Almost always a transformer with a scalar output head replacing the language-modeling head. Common starting points: the SFT model itself (so the reward model has the same world model as the policy), or a separately-trained classifier-style transformer. Initialization from a strong base matters — RM quality is bounded by the base model’s understanding of the task.

The calibration problem

Reward models trained with pairwise loss are only constrained on differences of scores. The absolute scale is arbitrary — the model can output rewards in [-100, 100] or [0, 1] and the pairwise loss is identical. This is fine for ranking but breaks any threshold-based decision (“trust the reward if greater than 0.7”).

This is exactly the same faced by rerankers, and the fix is the same: pass pairwise preferences through a to get continuous targets on a fixed scale, then train pointwise against those. zELO’s recipe applied to reward modeling produces calibrated reward models.

Why DPO obviated reward modeling for alignment

showed that, for the specific purpose of LLM alignment, the closed-form RLHF optimum can be computed directly from the policy and reference. No separate reward model is needed; the policy implicitly is the reward model up to a constant. This collapses the RLHF pipeline by a stage and removes the over-optimization failure mode (you can’t over-optimize a reward signal that doesn’t exist as a separate model).

For alignment, this is a clean win. But reward modeling lives on in:

Where reward models still matter
  • Rerankers. zerank-2 is structurally a calibrated reward model for (query, document) pairs.
  • LLM-as-judge systems. Lightweight reward models in disguise — score a candidate against a rubric.
  • Iterative RLAIF. Scoring many candidates and picking top-k (rejection sampling, best-of-N) needs an explicit RM. DPO cannot help.
  • Process reward models. Step-by-step reasoning evaluation — score each chain-of-thought step, not just the final answer. Used in modern math/code training.
  • Safety classifiers. Toxicity, jailbreak detection, and policy-violation scoring are reward models with binary or low-dimensional output.

A standard reward model scores the final answer: . A process reward model (PRM) scores each step of a reasoning chain: produces a score per intermediate step.

The motivation is failure-localization. A wrong final answer might come from a single bad step in a 10-step chain — the other 9 were correct, but the chain compounds the error. A PRM lets you (a) train the policy to assign credit per step, not per outcome, and (b) at inference, prune chains as soon as a low-quality step appears, saving compute on dead ends.

Process reward models are the supervision signal behind o1-style and DeepSeek-R1-style reasoning models. Training data is expensive — you need step-level annotations, often via tree search and rollout — but the resulting policies are dramatically better at multi-step math and code.

The Bradley-Terry loss is informative for training but uninterpretable in absolute terms — a “good” loss depends on the difficulty of the pairs, the calibration of the model, and the noise floor of the labels. So benchmarks evaluate by pairwise accuracy: across held-out (chosen, rejected) pairs, how often does the RM score chosen higher than rejected?

The standard suite is RewardBench (Lambert et al., 2024), which breaks accuracy down by category — chat, safety, reasoning, code. Strong RMs hit 90%+ on chat and 75-85% on reasoning; the gap reflects how much harder it is to discriminate quality on multi-step problems. Calibration is a separate axis and almost no public benchmarks measure it well — which is why most off-the-shelf reward models are rank-correct but not calibrated, and why production stacks that need thresholds (rejection sampling, refuse-or-answer routing) re-train their own RMs.

Reward model evaluation

The standard benchmark is RewardBench (Lambert et al., 2024) — pairwise accuracy across chat, safety, reasoning, and code categories. Strong RMs hit 80-90%, weaker ones around 70%. Calibration is a separate axis, evaluated rarely and harder to get right than rank accuracy — which is why production stacks that need score thresholds (rejection sampling, refuse-or-answer routing) train their own RMs against -recovered targets rather than reusing public ones.

Go further

What's the loss function for a reward model?

Bradley-Terry loss on pairwise preferences: given (winner, loser), maximize . The reward model outputs a scalar; the difference of scalars predicts which response is preferred. Same shape as Thurstone/Elo.

Why is reward model overoptimization a problem?

A reward model is an imperfect proxy for true human preference. Optimize against it hard enough and you find inputs the RM rates highly that humans actually wouldn't (Goodhart's law in action). The KL penalty in RLHF is meant to limit how far policy can drift; in practice, careful reward-model evaluation and conservative training are the real defenses.

How do reranker training targets differ from reward modeling?

A reranker is a reward model for relevance — same shape, different domain. zELO's pipeline trains a pointwise reranker against Thurstone-recovered targets, structurally identical to how a calibrated reward model is trained from pairwise preferences. The retrieval and alignment communities reinvented the same machinery in parallel.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord