Saved in:
Bibliographic Details
Main Authors: Chen, Zhuotong, Liu, Fang, Zhu, Xuan, Qi, Yanjun, Ghavamzadeh, Mohammad
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2502.04567
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866917916097642496
author Chen, Zhuotong
Liu, Fang
Zhu, Xuan
Qi, Yanjun
Ghavamzadeh, Mohammad
author_facet Chen, Zhuotong
Liu, Fang
Zhu, Xuan
Qi, Yanjun
Ghavamzadeh, Mohammad
contents Existing studies on preference optimization (PO) have centered on constructing pairwise preference data following simple heuristics, such as maximizing the margin between preferred and dispreferred completions based on human (or AI) ranked scores. However, none of these heuristics has a full theoretical justification. In this work, we develop a novel PO framework that provides theoretical guidance to effectively sample dispreferred completions. To achieve this, we formulate PO as minimizing the negative log-likelihood (NLL) of a probability model and propose to estimate its normalization constant via a sampling strategy. As we will demonstrate, these estimative samples can act as dispreferred completions in PO. We then select contrastive divergence (CD) as the sampling strategy, and propose a novel MC-PO algorithm that applies the Monte Carlo (MC) kernel from CD to sample hard negatives w.r.t. the parameterized reward model. Finally, we propose the OnMC-PO algorithm, an extension of MC-PO to the online setting. On popular alignment benchmarks, MC-PO outperforms existing SOTA baselines, and OnMC-PO leads to further improvement.
format Preprint
id arxiv_https___arxiv_org_abs_2502_04567
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Preference Optimization via Contrastive Divergence: Your Reward Model is Secretly an NLL Estimator
Chen, Zhuotong
Liu, Fang
Zhu, Xuan
Qi, Yanjun
Ghavamzadeh, Mohammad
Artificial Intelligence
Existing studies on preference optimization (PO) have centered on constructing pairwise preference data following simple heuristics, such as maximizing the margin between preferred and dispreferred completions based on human (or AI) ranked scores. However, none of these heuristics has a full theoretical justification. In this work, we develop a novel PO framework that provides theoretical guidance to effectively sample dispreferred completions. To achieve this, we formulate PO as minimizing the negative log-likelihood (NLL) of a probability model and propose to estimate its normalization constant via a sampling strategy. As we will demonstrate, these estimative samples can act as dispreferred completions in PO. We then select contrastive divergence (CD) as the sampling strategy, and propose a novel MC-PO algorithm that applies the Monte Carlo (MC) kernel from CD to sample hard negatives w.r.t. the parameterized reward model. Finally, we propose the OnMC-PO algorithm, an extension of MC-PO to the online setting. On popular alignment benchmarks, MC-PO outperforms existing SOTA baselines, and OnMC-PO leads to further improvement.
title Preference Optimization via Contrastive Divergence: Your Reward Model is Secretly an NLL Estimator
topic Artificial Intelligence
url https://arxiv.org/abs/2502.04567