Affichage MARC: :: Library Catalog

Enregistré dans:

Détails bibliographiques
Auteurs principaux:	Houliston, Sam, Pace, Alizée, Immer, Alexander, Rätsch, Gunnar
Format:	Preprint
Publié:	2024
Sujets:	Machine Learning Artificial Intelligence Learning and adaptive systems in artificial intelligence
Accès en ligne:	https://arxiv.org/abs/2410.20187
Tags:	Ajouter un tag Pas de tags, Soyez le premier à ajouter un tag!

_version_	1866916456092925952
author	Houliston, Sam Pace, Alizée Immer, Alexander Rätsch, Gunnar
author_facet	Houliston, Sam Pace, Alizée Immer, Alexander Rätsch, Gunnar
contents	Aligning Large Language Models (LLMs) to human preferences in content, style, and presentation is challenging, in part because preferences are varied, context-dependent, and sometimes inherently ambiguous. While successful, Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) are prone to the issue of proxy reward overoptimization. Analysis of the DPO loss reveals a critical need for regularization for mislabeled or ambiguous preference pairs to avoid reward hacking. In this work, we develop a pessimistic framework for DPO by introducing preference uncertainty penalization schemes, inspired by offline reinforcement learning. The penalization serves as a correction to the loss which attenuates the loss gradient for uncertain samples. Evaluation of the methods is performed with GPT2 Medium on the Anthropic-HH dataset using a model ensemble to obtain uncertainty estimates, and shows improved overall performance compared to vanilla DPO, as well as better completions on prompts from high-uncertainty chosen/rejected responses.
format	Preprint
id	arxiv_https___arxiv_org_abs_2410_20187
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Uncertainty-Penalized Direct Preference Optimization Houliston, Sam Pace, Alizée Immer, Alexander Rätsch, Gunnar Machine Learning Artificial Intelligence Learning and adaptive systems in artificial intelligence Aligning Large Language Models (LLMs) to human preferences in content, style, and presentation is challenging, in part because preferences are varied, context-dependent, and sometimes inherently ambiguous. While successful, Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) are prone to the issue of proxy reward overoptimization. Analysis of the DPO loss reveals a critical need for regularization for mislabeled or ambiguous preference pairs to avoid reward hacking. In this work, we develop a pessimistic framework for DPO by introducing preference uncertainty penalization schemes, inspired by offline reinforcement learning. The penalization serves as a correction to the loss which attenuates the loss gradient for uncertain samples. Evaluation of the methods is performed with GPT2 Medium on the Anthropic-HH dataset using a model ensemble to obtain uncertainty estimates, and shows improved overall performance compared to vanilla DPO, as well as better completions on prompts from high-uncertainty chosen/rejected responses.
title	Uncertainty-Penalized Direct Preference Optimization
topic	Machine Learning Artificial Intelligence Learning and adaptive systems in artificial intelligence
url	https://arxiv.org/abs/2410.20187

Documents similaires