Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wang, Yue, Wang, Qizhou, Zhang, Zizhuo, Niu, Gang, Han, Bo, Sugiyama, Masashi
Format:	Preprint
Published:	2025
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2512.00778
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910222990180352
author	Wang, Yue Wang, Qizhou Zhang, Zizhuo Niu, Gang Han, Bo Sugiyama, Masashi
author_facet	Wang, Yue Wang, Qizhou Zhang, Zizhuo Niu, Gang Han, Bo Sugiyama, Masashi
contents	Preference optimization (PO) is indispensable for large language models (LLMs), with methods such as direct preference optimization (DPO) and proximal policy optimization (PPO) achieving great success. A common belief is that DPO is supervised learning while PPO is reinforcement learning, yet deeper analyses for the reasons underlying these differences remain lacking. To fill this gap, we analyze their optimization dynamics, revealing distinct algorithmic behaviors and comprehending their underlying causes. First, we examine the target directions of gradient-based updates and find that DPO follows stable targets, whereas PPO balances exploration and exploitation, validating the common belief yet from this new perspective. Second, we examine the roles of positive learning, negative learning, and loss reweighting, which are three key yet seldom discussed components within PO methods. Our analyses reveal that these components play fairly different roles. In DPO, positive and negative learning jointly shape the targets. However, loss reweighting in DPO acts less as a reward signal but more as a regularizer to mitigate overfitting. In PPO, negative learning primarily supports exploration rather than determining the targets. Meanwhile, loss reweighting, related to the absolute advantages, indicates the distinct roles of token groups in updating targets. Given these findings, we conduct carefully designed ablation studies to further examine how controlling these dynamics impacts optimization efficiency and practical performance. The insights gained from our analyses not only deepen the understanding of PO methods but also inspire the development of more preference-aligned LLMs.
format	Preprint
id	arxiv_https___arxiv_org_abs_2512_00778
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	What Is Preference Optimization Doing, and Why? Wang, Yue Wang, Qizhou Zhang, Zizhuo Niu, Gang Han, Bo Sugiyama, Masashi Machine Learning Preference optimization (PO) is indispensable for large language models (LLMs), with methods such as direct preference optimization (DPO) and proximal policy optimization (PPO) achieving great success. A common belief is that DPO is supervised learning while PPO is reinforcement learning, yet deeper analyses for the reasons underlying these differences remain lacking. To fill this gap, we analyze their optimization dynamics, revealing distinct algorithmic behaviors and comprehending their underlying causes. First, we examine the target directions of gradient-based updates and find that DPO follows stable targets, whereas PPO balances exploration and exploitation, validating the common belief yet from this new perspective. Second, we examine the roles of positive learning, negative learning, and loss reweighting, which are three key yet seldom discussed components within PO methods. Our analyses reveal that these components play fairly different roles. In DPO, positive and negative learning jointly shape the targets. However, loss reweighting in DPO acts less as a reward signal but more as a regularizer to mitigate overfitting. In PPO, negative learning primarily supports exploration rather than determining the targets. Meanwhile, loss reweighting, related to the absolute advantages, indicates the distinct roles of token groups in updating targets. Given these findings, we conduct carefully designed ablation studies to further examine how controlling these dynamics impacts optimization efficiency and practical performance. The insights gained from our analyses not only deepen the understanding of PO methods but also inspire the development of more preference-aligned LLMs.
title	What Is Preference Optimization Doing, and Why?
topic	Machine Learning
url	https://arxiv.org/abs/2512.00778

Similar Items