Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Chen, Changyu, Liu, Zichen, Du, Chao, Pang, Tianyu, Liu, Qian, Sinha, Arunesh, Varakantham, Pradeep, Lin, Min
Format:	Preprint
Published:	2024
Subjects:	Computation and Language Machine Learning
Online Access:	https://arxiv.org/abs/2406.09760
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910863070330880
author	Chen, Changyu Liu, Zichen Du, Chao Pang, Tianyu Liu, Qian Sinha, Arunesh Varakantham, Pradeep Lin, Min
author_facet	Chen, Changyu Liu, Zichen Du, Chao Pang, Tianyu Liu, Qian Sinha, Arunesh Varakantham, Pradeep Lin, Min
contents	Human alignment in large language models (LLMs) is an active area of research. A recent groundbreaking work, direct preference optimization (DPO), has greatly simplified the process from past work in reinforcement learning from human feedback (RLHF) by bypassing the reward learning stage in RLHF. DPO, after training, provides an implicit reward model. In this work, we make a novel observation that this implicit reward model can by itself be used in a bootstrapping fashion to further align the LLM. Our approach is to use the rewards from a current LLM to construct a preference dataset, which is then used in subsequent DPO rounds. We incorporate two refinements to further improve our approach: 1) length-regularized reward shaping to make the preference dataset length-unbiased; 2) experience replay to enhance the quality of the preference dataset. Our approach, named self-alignment with DPO ImpliCit rEwards (DICE), shows great improvements in alignment. It achieves an increase of more than 8$\\%$ in lengthcontrolled win rate on AlpacaEval 2 for all the different base models that we tried, without relying on external feedback. Our code is available at https://github.com/sail-sg/dice.
format	Preprint
id	arxiv_https___arxiv_org_abs_2406_09760
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Bootstrapping Language Models with DPO Implicit Rewards Chen, Changyu Liu, Zichen Du, Chao Pang, Tianyu Liu, Qian Sinha, Arunesh Varakantham, Pradeep Lin, Min Computation and Language Machine Learning Human alignment in large language models (LLMs) is an active area of research. A recent groundbreaking work, direct preference optimization (DPO), has greatly simplified the process from past work in reinforcement learning from human feedback (RLHF) by bypassing the reward learning stage in RLHF. DPO, after training, provides an implicit reward model. In this work, we make a novel observation that this implicit reward model can by itself be used in a bootstrapping fashion to further align the LLM. Our approach is to use the rewards from a current LLM to construct a preference dataset, which is then used in subsequent DPO rounds. We incorporate two refinements to further improve our approach: 1) length-regularized reward shaping to make the preference dataset length-unbiased; 2) experience replay to enhance the quality of the preference dataset. Our approach, named self-alignment with DPO ImpliCit rEwards (DICE), shows great improvements in alignment. It achieves an increase of more than 8$\\%$ in lengthcontrolled win rate on AlpacaEval 2 for all the different base models that we tried, without relying on external feedback. Our code is available at https://github.com/sail-sg/dice.
title	Bootstrapping Language Models with DPO Implicit Rewards
topic	Computation and Language Machine Learning
url	https://arxiv.org/abs/2406.09760

Similar Items