Saved in:
Bibliographic Details
Main Authors: Sun, Wangtao, Cheng, Xiang, Yu, Xing, Xu, Haotian, Yang, Zhao, He, Shizhu, Zhao, Jun, Liu, Kang
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2503.22480
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866913840770318336
author Sun, Wangtao
Cheng, Xiang
Yu, Xing
Xu, Haotian
Yang, Zhao
He, Shizhu
Zhao, Jun
Liu, Kang
author_facet Sun, Wangtao
Cheng, Xiang
Yu, Xing
Xu, Haotian
Yang, Zhao
He, Shizhu
Zhao, Jun
Liu, Kang
contents Reinforcement learning from human feedback (RLHF) is a critical technique for training large language models. However, conventional reward models based on the Bradley-Terry model (BTRM) often suffer from overconfidence when faced with inconsistent labels or out-of-distribution samples, leading to reward hacking, where the policy model blindly optimizes for proxy rewards while degrading true performance. This paper proposes the Probabilistic Uncertain Reward Model (PURM), which generalizes the Bradley-Terry model to learn the reward distributions that emerged from the preference data. We theoretically derive the loss function of PURM and introduce a novel method that uses the overlap between distributions to quantify uncertainty. Empirical results show that PURM outperforms existing methods with more accurate reward and sound uncertainty estimations, and sustains effective learning for more optimization steps and obtain higher maximum win rate in RLHF. The data and code of this paper are released at https://anonymous.4open.science/r/Probabilistic-Uncertain-Reward-Model/
format Preprint
id arxiv_https___arxiv_org_abs_2503_22480
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Probabilistic Uncertain Reward Model
Sun, Wangtao
Cheng, Xiang
Yu, Xing
Xu, Haotian
Yang, Zhao
He, Shizhu
Zhao, Jun
Liu, Kang
Machine Learning
Reinforcement learning from human feedback (RLHF) is a critical technique for training large language models. However, conventional reward models based on the Bradley-Terry model (BTRM) often suffer from overconfidence when faced with inconsistent labels or out-of-distribution samples, leading to reward hacking, where the policy model blindly optimizes for proxy rewards while degrading true performance. This paper proposes the Probabilistic Uncertain Reward Model (PURM), which generalizes the Bradley-Terry model to learn the reward distributions that emerged from the preference data. We theoretically derive the loss function of PURM and introduce a novel method that uses the overlap between distributions to quantify uncertainty. Empirical results show that PURM outperforms existing methods with more accurate reward and sound uncertainty estimations, and sustains effective learning for more optimization steps and obtain higher maximum win rate in RLHF. The data and code of this paper are released at https://anonymous.4open.science/r/Probabilistic-Uncertain-Reward-Model/
title Probabilistic Uncertain Reward Model
topic Machine Learning
url https://arxiv.org/abs/2503.22480