Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Lin, Yong, Seto, Skyler, ter Hoeve, Maartje, Metcalf, Katherine, Theobald, Barry-John, Wang, Xuan, Zhang, Yizhe, Huang, Chen, Zhang, Tong
Format:	Preprint
Published:	2024
Subjects:	Machine Learning Computation and Language
Online Access:	https://arxiv.org/abs/2409.03650
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866929525375369216
author	Lin, Yong Seto, Skyler ter Hoeve, Maartje Metcalf, Katherine Theobald, Barry-John Wang, Xuan Zhang, Yizhe Huang, Chen Zhang, Tong
author_facet	Lin, Yong Seto, Skyler ter Hoeve, Maartje Metcalf, Katherine Theobald, Barry-John Wang, Xuan Zhang, Yizhe Huang, Chen Zhang, Tong
contents	Reinforcement Learning from Human Feedback (RLHF) is an effective approach for aligning language models to human preferences. Central to RLHF is learning a reward function for scoring human preferences. Two main approaches for learning a reward model are 1) training an EXplicit Reward Model (EXRM) as in RLHF, and 2) using an implicit reward learned from preference data through methods such as Direct Preference Optimization (DPO). Prior work has shown that the implicit reward model of DPO (denoted as DPORM) can approximate an EXRM in the limit. DPORM's effectiveness directly implies the optimality of the learned policy, and also has practical implication for LLM alignment methods including iterative DPO. However, it is unclear how well DPORM empirically matches the performance of EXRM. This work studies the accuracy at distinguishing preferred and rejected answers for both DPORM and EXRM. Our findings indicate that even though DPORM fits the training dataset comparably, it generalizes less effectively than EXRM, especially when the validation datasets contain distribution shifts. Across five out-of-distribution settings, DPORM has a mean drop in accuracy of 3% and a maximum drop of 7%. These findings highlight that DPORM has limited generalization ability and substantiates the integration of an explicit reward model in iterative DPO approaches.
format	Preprint
id	arxiv_https___arxiv_org_abs_2409_03650
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	On the Limited Generalization Capability of the Implicit Reward Model Induced by Direct Preference Optimization Lin, Yong Seto, Skyler ter Hoeve, Maartje Metcalf, Katherine Theobald, Barry-John Wang, Xuan Zhang, Yizhe Huang, Chen Zhang, Tong Machine Learning Computation and Language Reinforcement Learning from Human Feedback (RLHF) is an effective approach for aligning language models to human preferences. Central to RLHF is learning a reward function for scoring human preferences. Two main approaches for learning a reward model are 1) training an EXplicit Reward Model (EXRM) as in RLHF, and 2) using an implicit reward learned from preference data through methods such as Direct Preference Optimization (DPO). Prior work has shown that the implicit reward model of DPO (denoted as DPORM) can approximate an EXRM in the limit. DPORM's effectiveness directly implies the optimality of the learned policy, and also has practical implication for LLM alignment methods including iterative DPO. However, it is unclear how well DPORM empirically matches the performance of EXRM. This work studies the accuracy at distinguishing preferred and rejected answers for both DPORM and EXRM. Our findings indicate that even though DPORM fits the training dataset comparably, it generalizes less effectively than EXRM, especially when the validation datasets contain distribution shifts. Across five out-of-distribution settings, DPORM has a mean drop in accuracy of 3% and a maximum drop of 7%. These findings highlight that DPORM has limited generalization ability and substantiates the integration of an explicit reward model in iterative DPO approaches.
title	On the Limited Generalization Capability of the Implicit Reward Model Induced by Direct Preference Optimization
topic	Machine Learning Computation and Language
url	https://arxiv.org/abs/2409.03650

Similar Items