Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Gopalan, Aditya, Chowdhury, Sayak Ray, Banerjee, Debangshu
Format:	Preprint
Published:	2025
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2510.20413
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908607053824000
author	Gopalan, Aditya Chowdhury, Sayak Ray Banerjee, Debangshu
author_facet	Gopalan, Aditya Chowdhury, Sayak Ray Banerjee, Debangshu
contents	Direct alignment algorithms such as Direct Preference Optimization (DPO) fine-tune models based on preference data, using only supervised learning instead of two-stage reinforcement learning with human feedback (RLHF). We show that DPO encodes a statistical estimation problem over reward functions induced by a parametric policy class. When the true reward function that generates preferences cannot be realized via the policy class, DPO becomes misspecified, resulting in failure modes such as preference order reversal, worsening of policy reward, and high sensitivity to the input preference data distribution. On the other hand, we study the local behavior of two-stage RLHF for a parametric class and relate it to a natural gradient step in policy space. Our fine-grained geometric characterization allows us to propose AuxDPO, which introduces additional auxiliary variables in the DPO loss function to help move towards the RLHF solution in a principled manner and mitigate the misspecification in DPO. We empirically demonstrate the superior performance of AuxDPO on didactic bandit settings as well as LLM alignment tasks.
format	Preprint
id	arxiv_https___arxiv_org_abs_2510_20413
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Why DPO is a Misspecified Estimator and How to Fix It Gopalan, Aditya Chowdhury, Sayak Ray Banerjee, Debangshu Machine Learning Direct alignment algorithms such as Direct Preference Optimization (DPO) fine-tune models based on preference data, using only supervised learning instead of two-stage reinforcement learning with human feedback (RLHF). We show that DPO encodes a statistical estimation problem over reward functions induced by a parametric policy class. When the true reward function that generates preferences cannot be realized via the policy class, DPO becomes misspecified, resulting in failure modes such as preference order reversal, worsening of policy reward, and high sensitivity to the input preference data distribution. On the other hand, we study the local behavior of two-stage RLHF for a parametric class and relate it to a natural gradient step in policy space. Our fine-grained geometric characterization allows us to propose AuxDPO, which introduces additional auxiliary variables in the DPO loss function to help move towards the RLHF solution in a principled manner and mitigate the misspecification in DPO. We empirically demonstrate the superior performance of AuxDPO on didactic bandit settings as well as LLM alignment tasks.
title	Why DPO is a Misspecified Estimator and How to Fix It
topic	Machine Learning
url	https://arxiv.org/abs/2510.20413

Similar Items