Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Li, Junbo, Wang, Zhangyang, Liu, Qiang
Format:	Preprint
Published:	2025
Subjects:	Machine Learning Artificial Intelligence
Online Access:	https://arxiv.org/abs/2502.05773
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912499984498688
author	Li, Junbo Wang, Zhangyang Liu, Qiang
author_facet	Li, Junbo Wang, Zhangyang Liu, Qiang
contents	Offline preference alignment for language models such as Direct Preference Optimization (DPO) is favored for its effectiveness and simplicity, eliminating the need for costly reinforcement learning. Various offline algorithms have been developed for different data settings, yet they lack a unified understanding. In this study, we introduce Pior-Informed Preference Alignment (PIPA), a unified, RL-free probabilistic framework that formulates language model preference alignment as a Maximum Likelihood Estimation (MLE) problem with prior constraints. This method effectively accommodates both paired and unpaired data, as well as answer and step-level annotations. We illustrate that DPO and KTO are special cases with different prior constraints within our framework. By integrating different types of prior information, we developed two variations of PIPA: PIPA-M and PIPA-N. Both algorithms demonstrate a $3\sim10\%$ performance enhancement on the GSM8K and MATH benchmarks across all configurations, achieving these gains without additional training or computational costs compared to existing algorithms.
format	Preprint
id	arxiv_https___arxiv_org_abs_2502_05773
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	PIPA: Preference Alignment as Prior-Informed Statistical Estimation Li, Junbo Wang, Zhangyang Liu, Qiang Machine Learning Artificial Intelligence Offline preference alignment for language models such as Direct Preference Optimization (DPO) is favored for its effectiveness and simplicity, eliminating the need for costly reinforcement learning. Various offline algorithms have been developed for different data settings, yet they lack a unified understanding. In this study, we introduce Pior-Informed Preference Alignment (PIPA), a unified, RL-free probabilistic framework that formulates language model preference alignment as a Maximum Likelihood Estimation (MLE) problem with prior constraints. This method effectively accommodates both paired and unpaired data, as well as answer and step-level annotations. We illustrate that DPO and KTO are special cases with different prior constraints within our framework. By integrating different types of prior information, we developed two variations of PIPA: PIPA-M and PIPA-N. Both algorithms demonstrate a $3\sim10\%$ performance enhancement on the GSM8K and MATH benchmarks across all configurations, achieving these gains without additional training or computational costs compared to existing algorithms.
title	PIPA: Preference Alignment as Prior-Informed Statistical Estimation
topic	Machine Learning Artificial Intelligence
url	https://arxiv.org/abs/2502.05773

Similar Items