Saved in:
| Main Authors: | , , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2503.22480 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866913840770318336 |
|---|---|
| author | Sun, Wangtao Cheng, Xiang Yu, Xing Xu, Haotian Yang, Zhao He, Shizhu Zhao, Jun Liu, Kang |
| author_facet | Sun, Wangtao Cheng, Xiang Yu, Xing Xu, Haotian Yang, Zhao He, Shizhu Zhao, Jun Liu, Kang |
| contents | Reinforcement learning from human feedback (RLHF) is a critical technique for training large language models. However, conventional reward models based on the Bradley-Terry model (BTRM) often suffer from overconfidence when faced with inconsistent labels or out-of-distribution samples, leading to reward hacking, where the policy model blindly optimizes for proxy rewards while degrading true performance.
This paper proposes the Probabilistic Uncertain Reward Model (PURM), which generalizes the Bradley-Terry model to learn the reward distributions that emerged from the preference data. We theoretically derive the loss function of PURM and introduce a novel method that uses the overlap between distributions to quantify uncertainty. Empirical results show that PURM outperforms existing methods with more accurate reward and sound uncertainty estimations, and sustains effective learning for more optimization steps and obtain higher maximum win rate in RLHF. The data and code of this paper are released at https://anonymous.4open.science/r/Probabilistic-Uncertain-Reward-Model/ |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2503_22480 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | Probabilistic Uncertain Reward Model Sun, Wangtao Cheng, Xiang Yu, Xing Xu, Haotian Yang, Zhao He, Shizhu Zhao, Jun Liu, Kang Machine Learning Reinforcement learning from human feedback (RLHF) is a critical technique for training large language models. However, conventional reward models based on the Bradley-Terry model (BTRM) often suffer from overconfidence when faced with inconsistent labels or out-of-distribution samples, leading to reward hacking, where the policy model blindly optimizes for proxy rewards while degrading true performance. This paper proposes the Probabilistic Uncertain Reward Model (PURM), which generalizes the Bradley-Terry model to learn the reward distributions that emerged from the preference data. We theoretically derive the loss function of PURM and introduce a novel method that uses the overlap between distributions to quantify uncertainty. Empirical results show that PURM outperforms existing methods with more accurate reward and sound uncertainty estimations, and sustains effective learning for more optimization steps and obtain higher maximum win rate in RLHF. The data and code of this paper are released at https://anonymous.4open.science/r/Probabilistic-Uncertain-Reward-Model/ |
| title | Probabilistic Uncertain Reward Model |
| topic | Machine Learning |
| url | https://arxiv.org/abs/2503.22480 |