Entropy Regularization

Q: How is it different from KL regularization to a reference policy?

Entropy regularization keeps the policy spread out intrinsically — it does not care what the spread-out distribution looks like, just that it is not too peaked. KL regularization to a reference (the dominant approach in RLHF/PPO and DPO) anchors the trained policy near a specific reference distribution. Entropy reg encourages exploration; KL reg encourages conservatism. PPO uses both: KL to the previous policy AND an entropy bonus.

Also known as: entropy bonus, max-entropy regularization, entropy term

TL;DR

Adding an entropy bonus to a training objective to keep the model's output distribution from collapsing too sharply. Used in policy-gradient RL (PPO, SAC, A3C) to encourage exploration.

Entropy regularization adds an explicit reward (or negative loss) for the model’s output distribution being spread out:

where is the entropy of the model’s predictive distribution and controls how strongly the model is pulled away from collapsing onto a single output. The technique is canonical in policy-gradient RL but shows up in other training regimes whenever “do not over-commit” is the right inductive bias.

Three places it shows up in practice

1. Policy-gradient RL

In on-policy methods (REINFORCE, A3C, PPO), the policy is a stochastic distribution over actions. The gradient of the expected return only flows through actions the policy takes. If the policy is already nearly deterministic, the agent never tries alternatives — the exploration problem.

Entropy reg keeps the policy stochastic during training:

loss = -mean(advantage * log_prob(actions))  # standard policy gradient
loss -= beta * entropy(policy_dist)          # entropy bonus

PPO’s canonical hyperparameter is entropy_coef = 0.01. Soft actor-critic (SAC) goes further: the entire reward formulation includes a α · H(π) term so the policy is learned to be high-entropy.

2. LM RLHF and DPO

In language-model RL fine-tuning ( RLHF , DPO ), the dominant regularization is not entropy but KL-to-reference. That said, an explicit entropy bonus often appears alongside the KL term — the KL keeps the policy near the SFT reference, the entropy bonus prevents pathological collapse within that constraint. The two are complementary: KL anchors the location, entropy preserves the spread.

3. Semi-supervised and pseudo-labeled training

In pseudo-label-based methods (FixMatch, Mean Teacher, noisy student), the model trains on its own confident predictions. The pathology is that a model that gets a few examples slightly wrong amplifies that wrongness into confident misclassification. A small entropy regularization on unlabeled examples blocks the collapse.

The relationship to temperature

Adjacent but distinct.

Temperature sampling is a decode-time intervention: train a model, then at inference, scale the logits by before softmax. Higher T = higher output entropy at decode. The weights are unchanged.

Entropy regularization is a training-time intervention: add an entropy term to the loss so the model learns to prefer spread-out predictions over peaked ones. The weights themselves carry the regularization.

A model trained with no entropy reg and decoded with high temperature can mimic the output of a model trained with entropy reg and decoded at T=1 — but only in the immediate one-step sense. Over many decode steps, the trained-in regularization compounds (the model has learned not to be over-confident) in a way temperature scaling cannot reproduce.

In RL, this distinction matters: temperature scaling at decode does not change the policy’s value estimates or its training dynamics. Entropy regularization at training does. SAC’s success rests on this difference.

How to tune β

The β coefficient controls how much of the gradient comes from “do the task” vs “stay uncertain.” Three useful defaults:

PPO on Atari / MuJoCo: β ≈ 0.01 (the original paper’s default). Larger β destabilizes; smaller β collapses.
SAC: α (their notation) is learned, with a target-entropy hyperparameter set to a function of the action-space dimensionality (e.g., −dim(A)). The learned α adapts to the training stage.
LM RL fine-tuning: usually 0 or very small (e.g., 1e-4), with KL-to-reference doing the heavy lifting. Larger entropy coefs interfere with the SFT reference structure.

The honest move is to start at the literature default for your setting, then sweep β on a log scale (10x steps) and pick by validation reward / NLL.

Failure modes

β too large: the policy never commits. Returns plateau because the agent keeps exploring instead of exploiting.
β too small: collapse to argmax, exploration dies, agent gets stuck.
β constant when it should decay: in long training runs, the right entropy budget changes over time. Linear or exponential decay on β often outperforms constant β.
β on the wrong term: in some formulations, the entropy is taken over the next-token distribution; in others, over the trajectory. Match the formulation to the algorithm.

The honest take

Entropy regularization is a one-hyperparameter cure for a specific pathology: a model that gets too confident too fast. It pays back enormously when that is your problem (RL exploration, pseudo-labeling) and is mostly a no-op when it is not (standard supervised classification, where the cross-entropy loss already provides the right signal). The technique is small but load-bearing — most published RL results have it, often invisibly inside the algorithm’s default config.

Go further

Why does entropy regularization help exploration in RL?

A policy that collapses to argmax stops exploring. Adding β·H(π(·|s)) to the reward (or subtracting from the loss) creates a gradient pressure toward higher-entropy policies, which by construction try more diverse actions. The agent gets penalized for being too sure of itself before the value estimates have stabilized. Once the value function is well-fit, β can decay so the policy can commit to good actions.

Temperature sampling

How is it different from KL regularization to a reference policy?

Entropy regularization keeps the policy spread out intrinsically — it does not care what the spread-out distribution looks like, just that it is not too peaked. KL regularization to a reference (the dominant approach in RLHF/PPO and DPO) anchors the trained policy near a specific reference distribution. Entropy reg encourages exploration; KL reg encourages conservatism. PPO uses both: KL to the previous policy AND an entropy bonus.

KL divergence RLHF

Does entropy regularization help in supervised fine-tuning?

Sometimes. In standard cross-entropy SFT it is rarely useful — the loss already pushes toward the one-hot label, and entropy regularization fights that signal. In semi-supervised settings with pseudo-labels, adding a small entropy bonus to unlabeled examples prevents the model from collapsing to a degenerate confident-and-wrong solution. The technique shows up under different names (confidence penalty, label smoothing-ish, FixMatch-style consistency).

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs