Saved in:
Bibliographic Details
Main Authors: Afonin, Nikita, Andriianov, Nikita, Hovhannisyan, Vahagn, Bageshpura, Nikhil, Liu, Kyle, Zhu, Kevin, Dev, Sunishchal, Panda, Ashwinee, Rogov, Oleg, Tutubalina, Elena, Panchenko, Alexander, Seleznyov, Mikhail
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2510.11288
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • Recent work has shown that narrow finetuning can produce broadly misaligned LLMs, a phenomenon termed emergent misalignment (EM). While concerning, these findings were limited to finetuning and activation steering, leaving out in-context learning (ICL). We therefore ask: does EM emerge in ICL? We find that it does: across four model families (Gemini, Kimi-K2, Grok, and Qwen), narrow in-context examples cause models to produce misaligned responses to benign, unrelated queries. With 16 in-context examples, EM rates range from 1% to 24% depending on model and domain, appearing with as few as 2 examples. Neither larger model scale nor explicit reasoning provides reliable protection, and larger models are typically even more susceptible. Next, we formulate and test a hypothesis, which explains in-context EM as conflict between safety objectives and context-following behavior. Consistent with this, instructing models to prioritize safety reduces EM while prioritizing context-following increases it. These findings establish ICL as a previously underappreciated vector for emergent misalignment that resists simple scaling-based solutions.