Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Shen, Yaojie, Wang, Xinyao, Niu, Yulei, Zhou, Ying, Tang, Lexin, Zhang, Libo, Chen, Fan, Wen, Longyin
Format:	Preprint
Published:	2024
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2409.08845
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866929499591933952
author	Shen, Yaojie Wang, Xinyao Niu, Yulei Zhou, Ying Tang, Lexin Zhang, Libo Chen, Fan Wen, Longyin
author_facet	Shen, Yaojie Wang, Xinyao Niu, Yulei Zhou, Ying Tang, Lexin Zhang, Libo Chen, Fan Wen, Longyin
contents	Preference Optimization (PO), is gaining popularity as an alternative choice of Proximal Policy Optimization (PPO) for aligning Large Language Models (LLMs). Recent research on aligning LLMs iteratively with synthetic or partially synthetic data shows promising results in scaling up PO training for both academic settings and proprietary trained models such as Llama3. Despite its success, our study shows that the length exploitation issue present in PO is even more severe in Iterative Preference Optimization (IPO) due to the iterative nature of the process. In this work, we study iterative preference optimization with synthetic data. We share the findings and analysis along the way of building the iterative preference optimization pipeline. More specifically, we discuss the length exploitation issue during iterative preference optimization and propose our training objective for iterative preference optimization, namely Agreement-aware Iterative Preference Optimization (AIPO). To demonstrate the effectiveness of our method, we conduct comprehensive experiments and achieve state-of-the-art performance on MT-Bench, AlpacaEval 2.0, and Arena-Hard. Our implementation and model checkpoints will be made available at https://github.com/bytedance/AIPO.
format	Preprint
id	arxiv_https___arxiv_org_abs_2409_08845
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	AIPO: Improving Training Objective for Iterative Preference Optimization Shen, Yaojie Wang, Xinyao Niu, Yulei Zhou, Ying Tang, Lexin Zhang, Libo Chen, Fan Wen, Longyin Computation and Language Preference Optimization (PO), is gaining popularity as an alternative choice of Proximal Policy Optimization (PPO) for aligning Large Language Models (LLMs). Recent research on aligning LLMs iteratively with synthetic or partially synthetic data shows promising results in scaling up PO training for both academic settings and proprietary trained models such as Llama3. Despite its success, our study shows that the length exploitation issue present in PO is even more severe in Iterative Preference Optimization (IPO) due to the iterative nature of the process. In this work, we study iterative preference optimization with synthetic data. We share the findings and analysis along the way of building the iterative preference optimization pipeline. More specifically, we discuss the length exploitation issue during iterative preference optimization and propose our training objective for iterative preference optimization, namely Agreement-aware Iterative Preference Optimization (AIPO). To demonstrate the effectiveness of our method, we conduct comprehensive experiments and achieve state-of-the-art performance on MT-Bench, AlpacaEval 2.0, and Arena-Hard. Our implementation and model checkpoints will be made available at https://github.com/bytedance/AIPO.
title	AIPO: Improving Training Objective for Iterative Preference Optimization
topic	Computation and Language
url	https://arxiv.org/abs/2409.08845

Similar Items