Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Chen, Yizhou, Liu, Yawen, Wang, Xuesi, Yu, Qingtao, Huzhang, Guangda, Zeng, Anxiang, Yu, Han, Zhou, Zhiming
Format:	Preprint
Published:	2025
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2504.14838
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916699279720448
author	Chen, Yizhou Liu, Yawen Wang, Xuesi Yu, Qingtao Huzhang, Guangda Zeng, Anxiang Yu, Han Zhou, Zhiming
author_facet	Chen, Yizhou Liu, Yawen Wang, Xuesi Yu, Qingtao Huzhang, Guangda Zeng, Anxiang Yu, Han Zhou, Zhiming
contents	The reward model (RM) that represents human preferences plays a crucial role in optimizing the outputs of large language models (LLMs), e.g., through reinforcement learning from human feedback (RLHF) or rejection sampling. However, a long challenge for RM is its uncertain reliability, i.e., LLM outputs with higher rewards may not align with actual human preferences. Currently, there is a lack of a convincing metric to quantify the reliability of RMs. To bridge this gap, we propose the \textit{\underline{R}eliable at \underline{$η$}} (RETA) metric, which directly measures the reliability of an RM by evaluating the average quality (scored by an oracle) of the top $η$ quantile responses assessed by an RM. On top of RETA, we present an integrated benchmarking pipeline that allows anyone to evaluate their own RM without incurring additional Oracle labeling costs. Extensive experimental studies demonstrate the superior stability of RETA metric, providing solid evaluations of the reliability of various publicly available and proprietary RMs. When dealing with an unreliable RM, we can use the RETA metric to identify the optimal quantile from which to select the responses.
format	Preprint
id	arxiv_https___arxiv_org_abs_2504_14838
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Establishing Reliability Metrics for Reward Models in Large Language Models Chen, Yizhou Liu, Yawen Wang, Xuesi Yu, Qingtao Huzhang, Guangda Zeng, Anxiang Yu, Han Zhou, Zhiming Artificial Intelligence The reward model (RM) that represents human preferences plays a crucial role in optimizing the outputs of large language models (LLMs), e.g., through reinforcement learning from human feedback (RLHF) or rejection sampling. However, a long challenge for RM is its uncertain reliability, i.e., LLM outputs with higher rewards may not align with actual human preferences. Currently, there is a lack of a convincing metric to quantify the reliability of RMs. To bridge this gap, we propose the \textit{\underline{R}eliable at \underline{$η$}} (RETA) metric, which directly measures the reliability of an RM by evaluating the average quality (scored by an oracle) of the top $η$ quantile responses assessed by an RM. On top of RETA, we present an integrated benchmarking pipeline that allows anyone to evaluate their own RM without incurring additional Oracle labeling costs. Extensive experimental studies demonstrate the superior stability of RETA metric, providing solid evaluations of the reliability of various publicly available and proprietary RMs. When dealing with an unreliable RM, we can use the RETA metric to identify the optimal quantile from which to select the responses.
title	Establishing Reliability Metrics for Reward Models in Large Language Models
topic	Artificial Intelligence
url	https://arxiv.org/abs/2504.14838

Similar Items