Saved in:
Bibliographic Details
Main Authors: Afonin, Nikita, Andriianov, Nikita, Hovhannisyan, Vahagn, Bageshpura, Nikhil, Liu, Kyle, Zhu, Kevin, Dev, Sunishchal, Panda, Ashwinee, Rogov, Oleg, Tutubalina, Elena, Panchenko, Alexander, Seleznyov, Mikhail
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2510.11288
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866915945188950016
author Afonin, Nikita
Andriianov, Nikita
Hovhannisyan, Vahagn
Bageshpura, Nikhil
Liu, Kyle
Zhu, Kevin
Dev, Sunishchal
Panda, Ashwinee
Rogov, Oleg
Tutubalina, Elena
Panchenko, Alexander
Seleznyov, Mikhail
author_facet Afonin, Nikita
Andriianov, Nikita
Hovhannisyan, Vahagn
Bageshpura, Nikhil
Liu, Kyle
Zhu, Kevin
Dev, Sunishchal
Panda, Ashwinee
Rogov, Oleg
Tutubalina, Elena
Panchenko, Alexander
Seleznyov, Mikhail
contents Recent work has shown that narrow finetuning can produce broadly misaligned LLMs, a phenomenon termed emergent misalignment (EM). While concerning, these findings were limited to finetuning and activation steering, leaving out in-context learning (ICL). We therefore ask: does EM emerge in ICL? We find that it does: across four model families (Gemini, Kimi-K2, Grok, and Qwen), narrow in-context examples cause models to produce misaligned responses to benign, unrelated queries. With 16 in-context examples, EM rates range from 1% to 24% depending on model and domain, appearing with as few as 2 examples. Neither larger model scale nor explicit reasoning provides reliable protection, and larger models are typically even more susceptible. Next, we formulate and test a hypothesis, which explains in-context EM as conflict between safety objectives and context-following behavior. Consistent with this, instructing models to prioritize safety reduces EM while prioritizing context-following increases it. These findings establish ICL as a previously underappreciated vector for emergent misalignment that resists simple scaling-based solutions.
format Preprint
id arxiv_https___arxiv_org_abs_2510_11288
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Emergent Misalignment via In-Context Learning: Narrow in-context examples can produce broadly misaligned LLMs
Afonin, Nikita
Andriianov, Nikita
Hovhannisyan, Vahagn
Bageshpura, Nikhil
Liu, Kyle
Zhu, Kevin
Dev, Sunishchal
Panda, Ashwinee
Rogov, Oleg
Tutubalina, Elena
Panchenko, Alexander
Seleznyov, Mikhail
Computation and Language
Recent work has shown that narrow finetuning can produce broadly misaligned LLMs, a phenomenon termed emergent misalignment (EM). While concerning, these findings were limited to finetuning and activation steering, leaving out in-context learning (ICL). We therefore ask: does EM emerge in ICL? We find that it does: across four model families (Gemini, Kimi-K2, Grok, and Qwen), narrow in-context examples cause models to produce misaligned responses to benign, unrelated queries. With 16 in-context examples, EM rates range from 1% to 24% depending on model and domain, appearing with as few as 2 examples. Neither larger model scale nor explicit reasoning provides reliable protection, and larger models are typically even more susceptible. Next, we formulate and test a hypothesis, which explains in-context EM as conflict between safety objectives and context-following behavior. Consistent with this, instructing models to prioritize safety reduces EM while prioritizing context-following increases it. These findings establish ICL as a previously underappreciated vector for emergent misalignment that resists simple scaling-based solutions.
title Emergent Misalignment via In-Context Learning: Narrow in-context examples can produce broadly misaligned LLMs
topic Computation and Language
url https://arxiv.org/abs/2510.11288