Vista Equipo: :: Library Catalog

Guardado en:

Detalles Bibliográficos
Autores principales:	Weckauff, Anietta, Zhang, Yuchen, Andriushchenko, Maksym
Formato:	Preprint
Publicado:	2026
Materias:	Artificial Intelligence
Acceso en línea:	https://arxiv.org/abs/2604.28082
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

_version_	1866909005147799552
author	Weckauff, Anietta Zhang, Yuchen Andriushchenko, Maksym
author_facet	Weckauff, Anietta Zhang, Yuchen Andriushchenko, Maksym
contents	Fine-tuning large language models (LLMs) on narrowly misaligned data generalizes to broadly misaligned behavior, a phenomenon termed emergent misalignment (EM). While prior work has found a correlation between harmful behavior and self-assessment in emergently misaligned models, it remains unclear how consistent this correspondence is across tasks and whether it varies across fine-tuning domains. We characterize the consistency of the EM persona by fine-tuning Qwen 2.5 32B Instruct on six narrowly misaligned domains (e.g., insecure code, risky financial advice, bad medical advice) and administering experiments including harmfulness evaluation, self-assessment, choosing between two descriptions of AI systems, output recognition, and score prediction. Our results reveal two distinct patterns: coherent-persona models, in which harmful behavior and self-reported misalignment are coupled, and inverted-persona models, which produce harmful outputs while identifying as aligned AI systems. These findings reveal a more fine-grained picture of the effects of emergent misalignment, calling into question the consistency of the EM persona.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_28082
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Characterizing the Consistency of the Emergent Misalignment Persona Weckauff, Anietta Zhang, Yuchen Andriushchenko, Maksym Artificial Intelligence Fine-tuning large language models (LLMs) on narrowly misaligned data generalizes to broadly misaligned behavior, a phenomenon termed emergent misalignment (EM). While prior work has found a correlation between harmful behavior and self-assessment in emergently misaligned models, it remains unclear how consistent this correspondence is across tasks and whether it varies across fine-tuning domains. We characterize the consistency of the EM persona by fine-tuning Qwen 2.5 32B Instruct on six narrowly misaligned domains (e.g., insecure code, risky financial advice, bad medical advice) and administering experiments including harmfulness evaluation, self-assessment, choosing between two descriptions of AI systems, output recognition, and score prediction. Our results reveal two distinct patterns: coherent-persona models, in which harmful behavior and self-reported misalignment are coupled, and inverted-persona models, which produce harmful outputs while identifying as aligned AI systems. These findings reveal a more fine-grained picture of the effects of emergent misalignment, calling into question the consistency of the EM persona.
title	Characterizing the Consistency of the Emergent Misalignment Persona
topic	Artificial Intelligence
url	https://arxiv.org/abs/2604.28082

Ejemplares similares