Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Lee, Andrew, Bai, Xiaoyan, Pres, Itamar, Wattenberg, Martin, Kummerfeld, Jonathan K., Mihalcea, Rada
Format:	Preprint
Published:	2024
Subjects:	Computation and Language Artificial Intelligence
Online Access:	https://arxiv.org/abs/2401.01967
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916080676503552
author	Lee, Andrew Bai, Xiaoyan Pres, Itamar Wattenberg, Martin Kummerfeld, Jonathan K. Mihalcea, Rada
author_facet	Lee, Andrew Bai, Xiaoyan Pres, Itamar Wattenberg, Martin Kummerfeld, Jonathan K. Mihalcea, Rada
contents	While alignment algorithms are now commonly used to tune pre-trained language models towards a user's preferences, we lack explanations for the underlying mechanisms in which models become ``aligned'', thus making it difficult to explain phenomena like jailbreaks. In this work we study a popular algorithm, direct preference optimization (DPO), and the mechanisms by which it reduces toxicity. Namely, we first study how toxicity is represented and elicited in a pre-trained language model, GPT2-medium. We then apply DPO with a carefully crafted pairwise dataset to reduce toxicity. We examine how the resulting model averts toxic outputs, and find that capabilities learned from pre-training are not removed, but rather bypassed. We use this insight to demonstrate a simple method to un-align the model, reverting it back to its toxic behavior.
format	Preprint
id	arxiv_https___arxiv_org_abs_2401_01967
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity Lee, Andrew Bai, Xiaoyan Pres, Itamar Wattenberg, Martin Kummerfeld, Jonathan K. Mihalcea, Rada Computation and Language Artificial Intelligence While alignment algorithms are now commonly used to tune pre-trained language models towards a user's preferences, we lack explanations for the underlying mechanisms in which models become ``aligned'', thus making it difficult to explain phenomena like jailbreaks. In this work we study a popular algorithm, direct preference optimization (DPO), and the mechanisms by which it reduces toxicity. Namely, we first study how toxicity is represented and elicited in a pre-trained language model, GPT2-medium. We then apply DPO with a carefully crafted pairwise dataset to reduce toxicity. We examine how the resulting model averts toxic outputs, and find that capabilities learned from pre-training are not removed, but rather bypassed. We use this insight to demonstrate a simple method to un-align the model, reverting it back to its toxic behavior.
title	A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
topic	Computation and Language Artificial Intelligence
url	https://arxiv.org/abs/2401.01967

Similar Items