Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Sun, Wangtao, Cheng, Xiang, Yu, Xing, Xu, Haotian, Yang, Zhao, He, Shizhu, Zhao, Jun, Liu, Kang
Format:	Preprint
Published:	2025
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2503.22480
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913840770318336
author	Sun, Wangtao Cheng, Xiang Yu, Xing Xu, Haotian Yang, Zhao He, Shizhu Zhao, Jun Liu, Kang
author_facet	Sun, Wangtao Cheng, Xiang Yu, Xing Xu, Haotian Yang, Zhao He, Shizhu Zhao, Jun Liu, Kang
contents	Reinforcement learning from human feedback (RLHF) is a critical technique for training large language models. However, conventional reward models based on the Bradley-Terry model (BTRM) often suffer from overconfidence when faced with inconsistent labels or out-of-distribution samples, leading to reward hacking, where the policy model blindly optimizes for proxy rewards while degrading true performance. This paper proposes the Probabilistic Uncertain Reward Model (PURM), which generalizes the Bradley-Terry model to learn the reward distributions that emerged from the preference data. We theoretically derive the loss function of PURM and introduce a novel method that uses the overlap between distributions to quantify uncertainty. Empirical results show that PURM outperforms existing methods with more accurate reward and sound uncertainty estimations, and sustains effective learning for more optimization steps and obtain higher maximum win rate in RLHF. The data and code of this paper are released at https://anonymous.4open.science/r/Probabilistic-Uncertain-Reward-Model/
format	Preprint
id	arxiv_https___arxiv_org_abs_2503_22480
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Probabilistic Uncertain Reward Model Sun, Wangtao Cheng, Xiang Yu, Xing Xu, Haotian Yang, Zhao He, Shizhu Zhao, Jun Liu, Kang Machine Learning Reinforcement learning from human feedback (RLHF) is a critical technique for training large language models. However, conventional reward models based on the Bradley-Terry model (BTRM) often suffer from overconfidence when faced with inconsistent labels or out-of-distribution samples, leading to reward hacking, where the policy model blindly optimizes for proxy rewards while degrading true performance. This paper proposes the Probabilistic Uncertain Reward Model (PURM), which generalizes the Bradley-Terry model to learn the reward distributions that emerged from the preference data. We theoretically derive the loss function of PURM and introduce a novel method that uses the overlap between distributions to quantify uncertainty. Empirical results show that PURM outperforms existing methods with more accurate reward and sound uncertainty estimations, and sustains effective learning for more optimization steps and obtain higher maximum win rate in RLHF. The data and code of this paper are released at https://anonymous.4open.science/r/Probabilistic-Uncertain-Reward-Model/
title	Probabilistic Uncertain Reward Model
topic	Machine Learning
url	https://arxiv.org/abs/2503.22480

Similar Items