The signal your reward model sends during RLHF training may be louder than it should be. A paper published on arXiv by Meta researchers in June 2026 identifies a structural flaw in how most reward models operate: they assign meaningfully different scores to responses that are, by any reasonable measure, equally good.
The paper, titled “Discretizing Reward Models,” frames this as oversensitivity. Reward models produce continuous scores, which sounds like a feature. Finer gradations mean the model can detect subtle quality differences. The Meta team argues the opposite is often true: those gradations frequently reflect noise rather than signal, and the optimization process treats noise as instruction.
The practical consequence is reward hacking. When a policy model is trained against an oversensitive reward model, it learns to exploit the scoring function rather than improve the quality of its outputs. The behavior looks like improvement on the reward metric. It is not improvement on anything the user would recognize.
The researchers propose replacing the single concept of reward model accuracy with two separate measurements: discriminative ability (can the model correctly order clearly different responses?) and specificity (does it refrain from penalizing responses that are genuinely equivalent?). Most current evaluations collapse these into one number, which masks the failure mode entirely.
The solution the team describes requires no retraining. Using Monte Carlo dropout, the method samples the reward model multiple times to produce a distribution of scores for each response, then clusters those distributions into discrete reward bands. Responses that fall within the same cluster receive the same reward signal. The scoring becomes coarser by design.
That coarseness is the point. A reward model that sorts responses into five meaningful quality tiers, rather than assigning 83.4 versus 84.1 to two functionally identical answers, gives the policy model less surface area to exploit. The paper reports that discretized rewards produce less reward hacking and better policies than training on the original continuous scores, validated in both controlled and natural reinforcement learning settings.
Reward model quality is one of the least discussed levers in RLHF pipelines. Most of the public attention on alignment training focuses on human preference data, dataset scale, and the choice of base model. The reward model itself is often treated as a solved problem once it achieves high accuracy on held-out comparisons. This paper challenges that assumption directly: high accuracy on a pairwise ranking task is compatible with severe oversensitivity on responses that should be treated as ties.
The training-free nature of the fix matters. Teams running post-training on capable base models can apply Monte Carlo dropout discretization to an existing reward model without rebuilding the training pipeline. That is a lower bar than retraining a reward model from scratch or switching to verifiable rewards, which require ground-truth answers and work only in domains where correct outputs can be checked programmatically.
The broader implication is methodological. If labs are evaluating their reward models only on discriminative accuracy, they may be shipping RLHF runs that optimize against a noisy signal without knowing it. The paper’s two-metric framework gives practitioners a concrete diagnostic to run before committing to a training run.
Teams running RLHF or RLAIF pipelines should benchmark their reward models separately on discriminative ability and specificity before the next fine-tuning cycle; a high composite accuracy score is not sufficient evidence that the reward signal is safe to optimize against.
Published on arXiv by Meta researchers in June 2026 (arXiv
.21795).