Saved in:
Bibliographic Details
Main Authors: Lee, Andrew, Bai, Xiaoyan, Pres, Itamar, Wattenberg, Martin, Kummerfeld, Jonathan K., Mihalcea, Rada
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2401.01967
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866916080676503552
author Lee, Andrew
Bai, Xiaoyan
Pres, Itamar
Wattenberg, Martin
Kummerfeld, Jonathan K.
Mihalcea, Rada
author_facet Lee, Andrew
Bai, Xiaoyan
Pres, Itamar
Wattenberg, Martin
Kummerfeld, Jonathan K.
Mihalcea, Rada
contents While alignment algorithms are now commonly used to tune pre-trained language models towards a user's preferences, we lack explanations for the underlying mechanisms in which models become ``aligned'', thus making it difficult to explain phenomena like jailbreaks. In this work we study a popular algorithm, direct preference optimization (DPO), and the mechanisms by which it reduces toxicity. Namely, we first study how toxicity is represented and elicited in a pre-trained language model, GPT2-medium. We then apply DPO with a carefully crafted pairwise dataset to reduce toxicity. We examine how the resulting model averts toxic outputs, and find that capabilities learned from pre-training are not removed, but rather bypassed. We use this insight to demonstrate a simple method to un-align the model, reverting it back to its toxic behavior.
format Preprint
id arxiv_https___arxiv_org_abs_2401_01967
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
Lee, Andrew
Bai, Xiaoyan
Pres, Itamar
Wattenberg, Martin
Kummerfeld, Jonathan K.
Mihalcea, Rada
Computation and Language
Artificial Intelligence
While alignment algorithms are now commonly used to tune pre-trained language models towards a user's preferences, we lack explanations for the underlying mechanisms in which models become ``aligned'', thus making it difficult to explain phenomena like jailbreaks. In this work we study a popular algorithm, direct preference optimization (DPO), and the mechanisms by which it reduces toxicity. Namely, we first study how toxicity is represented and elicited in a pre-trained language model, GPT2-medium. We then apply DPO with a carefully crafted pairwise dataset to reduce toxicity. We examine how the resulting model averts toxic outputs, and find that capabilities learned from pre-training are not removed, but rather bypassed. We use this insight to demonstrate a simple method to un-align the model, reverting it back to its toxic behavior.
title A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
topic Computation and Language
Artificial Intelligence
url https://arxiv.org/abs/2401.01967