Enregistré dans:
Détails bibliographiques
Auteurs principaux: Houliston, Sam, Pace, Alizée, Immer, Alexander, Rätsch, Gunnar
Format: Preprint
Publié: 2024
Sujets:
Accès en ligne:https://arxiv.org/abs/2410.20187
Tags: Ajouter un tag
Pas de tags, Soyez le premier à ajouter un tag!
_version_ 1866916456092925952
author Houliston, Sam
Pace, Alizée
Immer, Alexander
Rätsch, Gunnar
author_facet Houliston, Sam
Pace, Alizée
Immer, Alexander
Rätsch, Gunnar
contents Aligning Large Language Models (LLMs) to human preferences in content, style, and presentation is challenging, in part because preferences are varied, context-dependent, and sometimes inherently ambiguous. While successful, Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) are prone to the issue of proxy reward overoptimization. Analysis of the DPO loss reveals a critical need for regularization for mislabeled or ambiguous preference pairs to avoid reward hacking. In this work, we develop a pessimistic framework for DPO by introducing preference uncertainty penalization schemes, inspired by offline reinforcement learning. The penalization serves as a correction to the loss which attenuates the loss gradient for uncertain samples. Evaluation of the methods is performed with GPT2 Medium on the Anthropic-HH dataset using a model ensemble to obtain uncertainty estimates, and shows improved overall performance compared to vanilla DPO, as well as better completions on prompts from high-uncertainty chosen/rejected responses.
format Preprint
id arxiv_https___arxiv_org_abs_2410_20187
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Uncertainty-Penalized Direct Preference Optimization
Houliston, Sam
Pace, Alizée
Immer, Alexander
Rätsch, Gunnar
Machine Learning
Artificial Intelligence
Learning and adaptive systems in artificial intelligence
Aligning Large Language Models (LLMs) to human preferences in content, style, and presentation is challenging, in part because preferences are varied, context-dependent, and sometimes inherently ambiguous. While successful, Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) are prone to the issue of proxy reward overoptimization. Analysis of the DPO loss reveals a critical need for regularization for mislabeled or ambiguous preference pairs to avoid reward hacking. In this work, we develop a pessimistic framework for DPO by introducing preference uncertainty penalization schemes, inspired by offline reinforcement learning. The penalization serves as a correction to the loss which attenuates the loss gradient for uncertain samples. Evaluation of the methods is performed with GPT2 Medium on the Anthropic-HH dataset using a model ensemble to obtain uncertainty estimates, and shows improved overall performance compared to vanilla DPO, as well as better completions on prompts from high-uncertainty chosen/rejected responses.
title Uncertainty-Penalized Direct Preference Optimization
topic Machine Learning
Artificial Intelligence
Learning and adaptive systems in artificial intelligence
url https://arxiv.org/abs/2410.20187