Salvato in:
Dettagli Bibliografici
Autori principali: Shi, Ruizhe, Song, Minhak, Zhou, Runlong, Zhang, Zihan, Fazel, Maryam, Du, Simon S.
Natura: Preprint
Pubblicazione: 2025
Soggetti:
Accesso online:https://arxiv.org/abs/2505.19770
Tags: Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!
_version_ 1866913114525532160
author Shi, Ruizhe
Song, Minhak
Zhou, Runlong
Zhang, Zihan
Fazel, Maryam
Du, Simon S.
author_facet Shi, Ruizhe
Song, Minhak
Zhou, Runlong
Zhang, Zihan
Fazel, Maryam
Du, Simon S.
contents We present a fine-grained theoretical analysis of the performance gap between two-stage reinforcement learning from human feedback~(RLHF) and direct preference optimization~(DPO). Our study decomposes this gap into two sources: the explicit representation gap under exact optimization and the implicit representation gap under finite samples. In the exact optimization setting, we characterize how the relative capacities of the reward and policy model classes influence the final policy qualities. We show that RLHF, DPO, or online DPO can outperform one another depending on type of model mis-specifications. Notably, online DPO can outperform both RLHF and standard DPO when the reward and policy model classes are isomorphic and both mis-specified. In the approximate optimization setting, we provide a concrete construction where the ground-truth reward is sparse and show that RLHF requires significantly fewer samples than DPO to recover an effective reward model, highlighting a statistical advantage of two-stage learning. Together, these results provide a comprehensive understanding of the performance gap between RLHF and DPO under various settings, and offer practical insights into when each method is preferred.
format Preprint
id arxiv_https___arxiv_org_abs_2505_19770
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO
Shi, Ruizhe
Song, Minhak
Zhou, Runlong
Zhang, Zihan
Fazel, Maryam
Du, Simon S.
Machine Learning
Computation and Language
We present a fine-grained theoretical analysis of the performance gap between two-stage reinforcement learning from human feedback~(RLHF) and direct preference optimization~(DPO). Our study decomposes this gap into two sources: the explicit representation gap under exact optimization and the implicit representation gap under finite samples. In the exact optimization setting, we characterize how the relative capacities of the reward and policy model classes influence the final policy qualities. We show that RLHF, DPO, or online DPO can outperform one another depending on type of model mis-specifications. Notably, online DPO can outperform both RLHF and standard DPO when the reward and policy model classes are isomorphic and both mis-specified. In the approximate optimization setting, we provide a concrete construction where the ground-truth reward is sparse and show that RLHF requires significantly fewer samples than DPO to recover an effective reward model, highlighting a statistical advantage of two-stage learning. Together, these results provide a comprehensive understanding of the performance gap between RLHF and DPO under various settings, and offer practical insights into when each method is preferred.
title Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO
topic Machine Learning
Computation and Language
url https://arxiv.org/abs/2505.19770