MARC21: :: Library Catalog

Salvato in:

Dettagli Bibliografici
Autori principali:	Shi, Ruizhe, Song, Minhak, Zhou, Runlong, Zhang, Zihan, Fazel, Maryam, Du, Simon S.
Natura:	Preprint
Pubblicazione:	2025
Soggetti:	Machine Learning Computation and Language
Accesso online:	https://arxiv.org/abs/2505.19770
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

_version_	1866913114525532160
author	Shi, Ruizhe Song, Minhak Zhou, Runlong Zhang, Zihan Fazel, Maryam Du, Simon S.
author_facet	Shi, Ruizhe Song, Minhak Zhou, Runlong Zhang, Zihan Fazel, Maryam Du, Simon S.
contents	We present a fine-grained theoretical analysis of the performance gap between two-stage reinforcement learning from human feedback~(RLHF) and direct preference optimization~(DPO). Our study decomposes this gap into two sources: the explicit representation gap under exact optimization and the implicit representation gap under finite samples. In the exact optimization setting, we characterize how the relative capacities of the reward and policy model classes influence the final policy qualities. We show that RLHF, DPO, or online DPO can outperform one another depending on type of model mis-specifications. Notably, online DPO can outperform both RLHF and standard DPO when the reward and policy model classes are isomorphic and both mis-specified. In the approximate optimization setting, we provide a concrete construction where the ground-truth reward is sparse and show that RLHF requires significantly fewer samples than DPO to recover an effective reward model, highlighting a statistical advantage of two-stage learning. Together, these results provide a comprehensive understanding of the performance gap between RLHF and DPO under various settings, and offer practical insights into when each method is preferred.
format	Preprint
id	arxiv_https___arxiv_org_abs_2505_19770
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO Shi, Ruizhe Song, Minhak Zhou, Runlong Zhang, Zihan Fazel, Maryam Du, Simon S. Machine Learning Computation and Language We present a fine-grained theoretical analysis of the performance gap between two-stage reinforcement learning from human feedback~(RLHF) and direct preference optimization~(DPO). Our study decomposes this gap into two sources: the explicit representation gap under exact optimization and the implicit representation gap under finite samples. In the exact optimization setting, we characterize how the relative capacities of the reward and policy model classes influence the final policy qualities. We show that RLHF, DPO, or online DPO can outperform one another depending on type of model mis-specifications. Notably, online DPO can outperform both RLHF and standard DPO when the reward and policy model classes are isomorphic and both mis-specified. In the approximate optimization setting, we provide a concrete construction where the ground-truth reward is sparse and show that RLHF requires significantly fewer samples than DPO to recover an effective reward model, highlighting a statistical advantage of two-stage learning. Together, these results provide a comprehensive understanding of the performance gap between RLHF and DPO under various settings, and offer practical insights into when each method is preferred.
title	Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO
topic	Machine Learning Computation and Language
url	https://arxiv.org/abs/2505.19770

Documenti analoghi