Saved in:
| Main Authors: | Arturi, Daniel Aarao Reis, Zhang, Eric, Ansah, Andrew, Zhu, Kevin, Panda, Ashwinee, Balwani, Aishwarya |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2511.02022 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
LoRI: Reducing Cross-Task Interference in Multi-Task Low-Rank Adaptation
by: Zhang, Juzheng, et al.
Published: (2025)
by: Zhang, Juzheng, et al.
Published: (2025)
Decomposing Behavioral Phase Transitions in LLMs: Order Parameters for Emergent Misalignment
by: Arnold, Julian, et al.
Published: (2025)
by: Arnold, Julian, et al.
Published: (2025)
Convergent Linear Representations of Emergent Misalignment
by: Soligo, Anna, et al.
Published: (2025)
by: Soligo, Anna, et al.
Published: (2025)
A New Linear Scaling Rule for Private Adaptive Hyperparameter Optimization
by: Panda, Ashwinee, et al.
Published: (2022)
by: Panda, Ashwinee, et al.
Published: (2022)
Model Organisms for Emergent Misalignment
by: Turner, Edward, et al.
Published: (2025)
by: Turner, Edward, et al.
Published: (2025)
Amortized Latent Steering: Low-Cost Alternative to Test-Time Optimization
by: Egbuna, Nathan, et al.
Published: (2025)
by: Egbuna, Nathan, et al.
Published: (2025)
Speculating Experts Accelerates Inference for Mixture-of-Experts
by: Madan, Vivan, et al.
Published: (2026)
by: Madan, Vivan, et al.
Published: (2026)
Peek-a-Boo Reasoning: Contrastive Region Masking in MLLMs
by: Chaturvedi, Isha, et al.
Published: (2025)
by: Chaturvedi, Isha, et al.
Published: (2025)
Alignment-Constrained Dynamic Pruning for LLMs: Identifying and Preserving Alignment-Critical Circuits
by: Patel, Dev, et al.
Published: (2025)
by: Patel, Dev, et al.
Published: (2025)
Understanding Emergent Misalignment via Feature Superposition Geometry
by: Minegishi, Gouki, et al.
Published: (2026)
by: Minegishi, Gouki, et al.
Published: (2026)
In-Training Defenses against Emergent Misalignment in Language Models
by: Kaczér, David, et al.
Published: (2025)
by: Kaczér, David, et al.
Published: (2025)
FRIT: Using Causal Importance to Improve Chain-of-Thought Faithfulness
by: Swaroop, Anand, et al.
Published: (2025)
by: Swaroop, Anand, et al.
Published: (2025)
Persona-Model Collapse in Emergent Misalignment
by: Costa, Davi Bastos, et al.
Published: (2026)
by: Costa, Davi Bastos, et al.
Published: (2026)
CommonKV: Compressing KV Cache with Cross-layer Parameter Sharing
by: Wang, Yixuan, et al.
Published: (2025)
by: Wang, Yixuan, et al.
Published: (2025)
Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models
by: Chua, James, et al.
Published: (2025)
by: Chua, James, et al.
Published: (2025)
Privacy Auditing of Large Language Models
by: Panda, Ashwinee, et al.
Published: (2025)
by: Panda, Ashwinee, et al.
Published: (2025)
Dense Backpropagation Improves Training for Sparse Mixture-of-Experts
by: Panda, Ashwinee, et al.
Published: (2025)
by: Panda, Ashwinee, et al.
Published: (2025)
From Narrow Unlearning to Emergent Misalignment: Causes, Consequences, and Containment in LLMs
by: Mushtaq, Erum, et al.
Published: (2025)
by: Mushtaq, Erum, et al.
Published: (2025)
Persona Features Control Emergent Misalignment
by: Wang, Miles, et al.
Published: (2025)
by: Wang, Miles, et al.
Published: (2025)
Shared Lexical Task Representations Explain Behavioral Variability In LLMs
by: Yang, Zhuonan, et al.
Published: (2026)
by: Yang, Zhuonan, et al.
Published: (2026)
SALT: Steering Activations towards Leakage-free Thinking in Chain of Thought
by: Batra, Shourya, et al.
Published: (2025)
by: Batra, Shourya, et al.
Published: (2025)
Emergent and Subliminal Misalignment Through the Lens of Data-Mediated Transfer
by: Askin, Baris, et al.
Published: (2026)
by: Askin, Baris, et al.
Published: (2026)
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
by: Betley, Jan, et al.
Published: (2025)
by: Betley, Jan, et al.
Published: (2025)
Teach LLMs to Phish: Stealing Private Information from Language Models
by: Panda, Ashwinee, et al.
Published: (2024)
by: Panda, Ashwinee, et al.
Published: (2024)
The Devil in the Details: Emergent Misalignment, Format and Coherence in Open-Weights LLMs
by: Dickson, Craig
Published: (2025)
by: Dickson, Craig
Published: (2025)
The Master Key Hypothesis: Unlocking Cross-Model Capability Transfer via Linear Subspace Alignment
by: Balasubramanian, Rishab, et al.
Published: (2026)
by: Balasubramanian, Rishab, et al.
Published: (2026)
Distilling Linearized Behavior into Non-Linear Fine-Tuning for Effective Task Arithmetic
by: Sommariva, Thomas, et al.
Published: (2026)
by: Sommariva, Thomas, et al.
Published: (2026)
Gemstones: A Model Suite for Multi-Faceted Scaling Laws
by: McLeish, Sean, et al.
Published: (2025)
by: McLeish, Sean, et al.
Published: (2025)
Moloch's Bargain: Emergent Misalignment When LLMs Compete for Audiences
by: El, Batu, et al.
Published: (2025)
by: El, Batu, et al.
Published: (2025)
Emergent Misalignment is Easy, Narrow Misalignment is Hard
by: Soligo, Anna, et al.
Published: (2026)
by: Soligo, Anna, et al.
Published: (2026)
Re-Emergent Misalignment: How Narrow Fine-Tuning Erodes Safety Alignment in LLMs
by: Giordani, Jeremiah
Published: (2025)
by: Giordani, Jeremiah
Published: (2025)
ROGUE: Misaligned Agent Behavior Arising from Ordinary Computer Use
by: Tien, Jeremy, et al.
Published: (2026)
by: Tien, Jeremy, et al.
Published: (2026)
On the Emergence of Cross-Task Linearity in the Pretraining-Finetuning Paradigm
by: Zhou, Zhanpeng, et al.
Published: (2024)
by: Zhou, Zhanpeng, et al.
Published: (2024)
QMP: Q-switch Mixture of Policies for Multi-Task Behavior Sharing
by: Zhang, Grace, et al.
Published: (2023)
by: Zhang, Grace, et al.
Published: (2023)
PASs-MoE: Mitigating Misaligned Co-drift among Router and Experts via Pathway Activation Subspaces for Continual Learning
by: Hou, Zhiyan, et al.
Published: (2026)
by: Hou, Zhiyan, et al.
Published: (2026)
BLOCK-EM: Preventing Emergent Misalignment via Latent Blocking
by: Ustaomeroglu, Muhammed, et al.
Published: (2026)
by: Ustaomeroglu, Muhammed, et al.
Published: (2026)
CoPS: Empowering LLM Agents with Provable Cross-Task Experience Sharing
by: Yang, Chen, et al.
Published: (2024)
by: Yang, Chen, et al.
Published: (2024)
Overtrained, Not Misaligned
by: Schreiber, Joel, et al.
Published: (2026)
by: Schreiber, Joel, et al.
Published: (2026)
The Latent Color Subspace: Emergent Order in High-Dimensional Chaos
by: Pach, Mateusz, et al.
Published: (2026)
by: Pach, Mateusz, et al.
Published: (2026)
Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality Inversion
by: Mistretta, Marco, et al.
Published: (2025)
by: Mistretta, Marco, et al.
Published: (2025)
Similar Items
-
LoRI: Reducing Cross-Task Interference in Multi-Task Low-Rank Adaptation
by: Zhang, Juzheng, et al.
Published: (2025) -
Decomposing Behavioral Phase Transitions in LLMs: Order Parameters for Emergent Misalignment
by: Arnold, Julian, et al.
Published: (2025) -
Convergent Linear Representations of Emergent Misalignment
by: Soligo, Anna, et al.
Published: (2025) -
A New Linear Scaling Rule for Private Adaptive Hyperparameter Optimization
by: Panda, Ashwinee, et al.
Published: (2022) -
Model Organisms for Emergent Misalignment
by: Turner, Edward, et al.
Published: (2025)