Saved in:
Bibliographic Details
Main Authors: Wang, Yue, Wang, Qizhou, Zhang, Zizhuo, Niu, Gang, Han, Bo, Sugiyama, Masashi
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2512.00778
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866910222990180352
author Wang, Yue
Wang, Qizhou
Zhang, Zizhuo
Niu, Gang
Han, Bo
Sugiyama, Masashi
author_facet Wang, Yue
Wang, Qizhou
Zhang, Zizhuo
Niu, Gang
Han, Bo
Sugiyama, Masashi
contents Preference optimization (PO) is indispensable for large language models (LLMs), with methods such as direct preference optimization (DPO) and proximal policy optimization (PPO) achieving great success. A common belief is that DPO is supervised learning while PPO is reinforcement learning, yet deeper analyses for the reasons underlying these differences remain lacking. To fill this gap, we analyze their optimization dynamics, revealing distinct algorithmic behaviors and comprehending their underlying causes. First, we examine the target directions of gradient-based updates and find that DPO follows stable targets, whereas PPO balances exploration and exploitation, validating the common belief yet from this new perspective. Second, we examine the roles of positive learning, negative learning, and loss reweighting, which are three key yet seldom discussed components within PO methods. Our analyses reveal that these components play fairly different roles. In DPO, positive and negative learning jointly shape the targets. However, loss reweighting in DPO acts less as a reward signal but more as a regularizer to mitigate overfitting. In PPO, negative learning primarily supports exploration rather than determining the targets. Meanwhile, loss reweighting, related to the absolute advantages, indicates the distinct roles of token groups in updating targets. Given these findings, we conduct carefully designed ablation studies to further examine how controlling these dynamics impacts optimization efficiency and practical performance. The insights gained from our analyses not only deepen the understanding of PO methods but also inspire the development of more preference-aligned LLMs.
format Preprint
id arxiv_https___arxiv_org_abs_2512_00778
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle What Is Preference Optimization Doing, and Why?
Wang, Yue
Wang, Qizhou
Zhang, Zizhuo
Niu, Gang
Han, Bo
Sugiyama, Masashi
Machine Learning
Preference optimization (PO) is indispensable for large language models (LLMs), with methods such as direct preference optimization (DPO) and proximal policy optimization (PPO) achieving great success. A common belief is that DPO is supervised learning while PPO is reinforcement learning, yet deeper analyses for the reasons underlying these differences remain lacking. To fill this gap, we analyze their optimization dynamics, revealing distinct algorithmic behaviors and comprehending their underlying causes. First, we examine the target directions of gradient-based updates and find that DPO follows stable targets, whereas PPO balances exploration and exploitation, validating the common belief yet from this new perspective. Second, we examine the roles of positive learning, negative learning, and loss reweighting, which are three key yet seldom discussed components within PO methods. Our analyses reveal that these components play fairly different roles. In DPO, positive and negative learning jointly shape the targets. However, loss reweighting in DPO acts less as a reward signal but more as a regularizer to mitigate overfitting. In PPO, negative learning primarily supports exploration rather than determining the targets. Meanwhile, loss reweighting, related to the absolute advantages, indicates the distinct roles of token groups in updating targets. Given these findings, we conduct carefully designed ablation studies to further examine how controlling these dynamics impacts optimization efficiency and practical performance. The insights gained from our analyses not only deepen the understanding of PO methods but also inspire the development of more preference-aligned LLMs.
title What Is Preference Optimization Doing, and Why?
topic Machine Learning
url https://arxiv.org/abs/2512.00778