Process Reward Model

Q: How is a PRM used at inference?

Two regimes. Best-of-N reranking: sample FORMULA chains, score each step with the PRM, aggregate (sum, min, or product) the per-step scores, and pick the highest-scoring chain. Min-aggregation is most common — a chain is only as strong as its weakest step. RL training: use the per-step PRM scores as a per-step reward in a policy-gradient loop, giving the model dense feedback rather than the sparse final-answer-only signal.

Q: Why does this enable test-time scaling?

An ORM only tells you which final answer to pick from FORMULA candidates — sample more and you can pick better. A PRM tells you which intermediate state is more promising, which means you can prune dead ends mid-chain (saving compute) and focus sampling on branches that look good (concentrating compute). Tree search over reasoning becomes tractable because you have a per-node value function. That's the unlock behind o1-style 'think longer for harder problems'.

Also known as: PRM, process supervision, step-level reward

TL;DR

A process reward model (PRM) scores each intermediate step of a reasoning chain, not just the final answer. It's the supervision signal that powers post-o1 reasoning models — credit assignment along the trajectory, not only at the end.

A process reward model (PRM) is a reward model that scores each step of a reasoning chain, not just the final answer. Where an outcome reward model (ORM) collapses an entire trajectory to a single scalar — was the answer right? — a PRM produces a per-step score, giving credit assignment along the chain.

Outcome vs process supervision

For a chain (problem, intermediate steps, final answer):

Outcome reward model (ORM): . One score per chain.
Process reward model (PRM): . One score per step.

The post-o1 era shifted toward PRMs because they solve the got-the-right-answer-for-the-wrong-reason problem. A 10-step chain with one bad step that happens to land on the correct answer looks identical to an ORM as a clean 10-step chain. To a PRM, they’re entirely different — and the gradient signal during RL training reflects that.

ORMs reward outcomes; PRMs reward trajectories. When the trajectory is the product — long-form reasoning, multi-step proofs, agentic action chains — outcome supervision under-specifies what the model should learn.

How PRMs are trained

Two label sources, in order of expense:

Step-level human annotation. Let’s Verify Step-by-Step (Lightman et al., 2023) — the foundational paper — collected the PRM800K dataset by having humans label each step in math chains as correct, incorrect, or neutral. Expensive but high-signal.
LLM-judge or automatic verifier labels. A strong judge model rates each step; for math and code, a deterministic verifier (a calculator, a unit-test runner) gives ground-truth labels for free. This is how production PRMs get to the millions-of-examples scale.

Training shape: take a transformer (often initialized from the same SFT base as the policy), add a per-token scalar head, and train with a per-step classification or regression loss. The architectural change is trivial; the data pipeline is everything.

How PRMs are used

The two production patterns

Best-of-N reranking with cumulative step scores. Sample chains, score every step of every chain, aggregate (typically minimum, sometimes product) into a chain score, pick the top one. Min-aggregation is the workhorse: a chain is only as strong as its weakest step.
RL with per-step rewards. During policy optimization, use as the reward for the action that produced step , rather than waiting for the chain to finish. Dense feedback shrinks the credit-assignment problem and accelerates RL convergence by an order of magnitude on reasoning tasks.
Search over reasoning trees. A PRM is a value function over partial chains, which makes tree search over reasoning steps tractable — prune branches with low PRM scores; expand branches with high ones.

Why this enables test-time scaling

The key property: a PRM gives you a value function over partial chains. With an ORM you can only score completed chains, so test-time scaling means sampling more completed chains ( self-consistency majority vote, basically). With a PRM you can score chains while they’re being generated — concentrating compute on promising branches and aborting dead ends.

That’s the lever behind o1-style “think longer for harder problems”: the PRM tells the policy when to keep exploring and when to commit, and that’s what scales with extra inference budget.

Where PRMs sit in the modern recipe

Anthropic, OpenAI, and DeepSeek have all reported PRM-style supervision as a component of their reasoning models. DeepSeek-R1’s training pipeline includes step-level rewards on math and code, validated through automatic execution. OpenAI’s o1 family is widely understood to use process supervision — though the details are not public. Anthropic has discussed process-style training for extended-thinking models in research releases.

The unifying picture: pretraining gives you a base model, SFT gives you instruction-following, and process-supervised RL on verifier-checkable problems gives you reasoning. The PRM is the second-most-important piece of that last step — second only to the verifier itself.

Lightman et al. (2023) trained both an ORM and a PRM on the same MATH dataset and ran a head-to-head comparison on best-of-N reranking. The PRM substantially outperformed the ORM at every — by 10+ points at on MATH — and the gap widened with more samples. That second observation is the important one: with more inference compute, the PRM gets more leverage from the dense step-level signal, while the ORM saturates because it’s only scoring outcomes.

This was the empirical proof that process supervision wasn’t just a cleaner research story — it was strictly better at scale. The dataset they released, PRM800K, became the de-facto starting point for academic PRM work; the paper’s framing (“verify each step”) set the conceptual vocabulary for the post-o1 era.

For tasks where the chain is short and the outcome is binary, ORMs and PRMs converge — there’s only one or two steps to score, and outcome correlates almost perfectly with step quality. Classification and one-shot QA fall here.

For tasks where the chain is long and the outcome is expensive to evaluate (essay writing, code review, design critique), PRMs dominate but are also harder to train — there’s no automatic verifier, so step labels come from LLM judges, which carries its own error profile.

For tasks where the chain is long and outcomes are cheap to verify (math, competitive programming, formal proofs), PRMs absolutely dominate ORMs and the gap grows with . This is where the bulk of post-o1 reasoning model training has concentrated, because verification-cheap-generation-hard is exactly the regime where you can manufacture vast amounts of step-level signal automatically.

Go further

How are PRMs labeled and trained?

Step-level labels — each intermediate step in a chain is annotated as correct, incorrect, or neutral. Lightman et al. (2023) used human annotators on math chains; modern recipes use LLM judges or automatic verifiers (executing code, checking math) to scale labeling. Training is then a per-step classification or regression head on top of a transformer trunk; the loss is summed across steps. The expensive part is the labels, not the architecture.

Reward modeling Chain-of-thought

How is a PRM used at inference?

Two regimes. Best-of-N reranking: sample chains, score each step with the PRM, aggregate (sum, min, or product) the per-step scores, and pick the highest-scoring chain. Min-aggregation is most common — a chain is only as strong as its weakest step. RL training: use the per-step PRM scores as a per-step reward in a policy-gradient loop, giving the model dense feedback rather than the sparse final-answer-only signal.

Self-consistency RLHF

Why does this enable test-time scaling?

An ORM only tells you which final answer to pick from candidates — sample more and you can pick better. A PRM tells you which intermediate state is more promising, which means you can prune dead ends mid-chain (saving compute) and focus sampling on branches that look good (concentrating compute). Tree search over reasoning becomes tractable because you have a per-node value function. That's the unlock behind o1-style 'think longer for harder problems'.

Reasoning model Tree-of-thought

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs