Internformat: :: Library Catalog

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Jørgenvåg, Magnus, Kaczér, David, Ruttert, Lasse, Gülhan, Marvin, Flek, Lucie, Mai, Florian
Format:	Preprint
Veröffentlicht:	2026
Schlagworte:	Computation and Language
Online-Zugang:	https://arxiv.org/abs/2605.31328
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

_version_	1866916066221883392
author	Jørgenvåg, Magnus Kaczér, David Ruttert, Lasse Gülhan, Marvin Flek, Lucie Mai, Florian
author_facet	Jørgenvåg, Magnus Kaczér, David Ruttert, Lasse Gülhan, Marvin Flek, Lucie Mai, Florian
contents	Emergent misalignment (EM) is the surprising tendency of language models to become broadly misaligned after fine-tuning on narrowly misaligned examples. While EM has been extensively studied in the supervised fine-tuning (SFT) setting, evidence that it also arises from reinforcement learning (RL) is limited to large, closed-source models, leaving the phenomenon expensive to study and difficult to reproduce. We characterize EM from RL in small, off-the-shelf open-weight models along three axes. First, we show that rewarding narrow, overtly misaligned behavior produces substantially higher general-domain misalignment than sample-matched SFT. Second, we show that EM from RL can be induced by reward signals that could plausibly arise naturally, such as unpopular aesthetic preferences or poor rhetorical appeals. Third, we evaluate in-training mitigations developed for SFT-induced EM and find that they broadly transfer, with interleaving on-policy safety data performing best.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_31328
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Reinforcement Learning Amplifies Emergent Misalignment from Harmless Rewards Jørgenvåg, Magnus Kaczér, David Ruttert, Lasse Gülhan, Marvin Flek, Lucie Mai, Florian Computation and Language Emergent misalignment (EM) is the surprising tendency of language models to become broadly misaligned after fine-tuning on narrowly misaligned examples. While EM has been extensively studied in the supervised fine-tuning (SFT) setting, evidence that it also arises from reinforcement learning (RL) is limited to large, closed-source models, leaving the phenomenon expensive to study and difficult to reproduce. We characterize EM from RL in small, off-the-shelf open-weight models along three axes. First, we show that rewarding narrow, overtly misaligned behavior produces substantially higher general-domain misalignment than sample-matched SFT. Second, we show that EM from RL can be induced by reward signals that could plausibly arise naturally, such as unpopular aesthetic preferences or poor rhetorical appeals. Third, we evaluate in-training mitigations developed for SFT-induced EM and find that they broadly transfer, with interleaving on-policy safety data performing best.
title	Reinforcement Learning Amplifies Emergent Misalignment from Harmless Rewards
topic	Computation and Language
url	https://arxiv.org/abs/2605.31328

Ähnliche Einträge