Saved in:
Bibliographic Details
Main Authors: Chen, Changyu, Liu, Zichen, Du, Chao, Pang, Tianyu, Liu, Qian, Sinha, Arunesh, Varakantham, Pradeep, Lin, Min
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2406.09760
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866910863070330880
author Chen, Changyu
Liu, Zichen
Du, Chao
Pang, Tianyu
Liu, Qian
Sinha, Arunesh
Varakantham, Pradeep
Lin, Min
author_facet Chen, Changyu
Liu, Zichen
Du, Chao
Pang, Tianyu
Liu, Qian
Sinha, Arunesh
Varakantham, Pradeep
Lin, Min
contents Human alignment in large language models (LLMs) is an active area of research. A recent groundbreaking work, direct preference optimization (DPO), has greatly simplified the process from past work in reinforcement learning from human feedback (RLHF) by bypassing the reward learning stage in RLHF. DPO, after training, provides an implicit reward model. In this work, we make a novel observation that this implicit reward model can by itself be used in a bootstrapping fashion to further align the LLM. Our approach is to use the rewards from a current LLM to construct a preference dataset, which is then used in subsequent DPO rounds. We incorporate two refinements to further improve our approach: 1) length-regularized reward shaping to make the preference dataset length-unbiased; 2) experience replay to enhance the quality of the preference dataset. Our approach, named self-alignment with DPO ImpliCit rEwards (DICE), shows great improvements in alignment. It achieves an increase of more than 8$\\%$ in lengthcontrolled win rate on AlpacaEval 2 for all the different base models that we tried, without relying on external feedback. Our code is available at https://github.com/sail-sg/dice.
format Preprint
id arxiv_https___arxiv_org_abs_2406_09760
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Bootstrapping Language Models with DPO Implicit Rewards
Chen, Changyu
Liu, Zichen
Du, Chao
Pang, Tianyu
Liu, Qian
Sinha, Arunesh
Varakantham, Pradeep
Lin, Min
Computation and Language
Machine Learning
Human alignment in large language models (LLMs) is an active area of research. A recent groundbreaking work, direct preference optimization (DPO), has greatly simplified the process from past work in reinforcement learning from human feedback (RLHF) by bypassing the reward learning stage in RLHF. DPO, after training, provides an implicit reward model. In this work, we make a novel observation that this implicit reward model can by itself be used in a bootstrapping fashion to further align the LLM. Our approach is to use the rewards from a current LLM to construct a preference dataset, which is then used in subsequent DPO rounds. We incorporate two refinements to further improve our approach: 1) length-regularized reward shaping to make the preference dataset length-unbiased; 2) experience replay to enhance the quality of the preference dataset. Our approach, named self-alignment with DPO ImpliCit rEwards (DICE), shows great improvements in alignment. It achieves an increase of more than 8$\\%$ in lengthcontrolled win rate on AlpacaEval 2 for all the different base models that we tried, without relying on external feedback. Our code is available at https://github.com/sail-sg/dice.
title Bootstrapping Language Models with DPO Implicit Rewards
topic Computation and Language
Machine Learning
url https://arxiv.org/abs/2406.09760