:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Arturi, Daniel Aarao Reis, Zhang, Eric, Ansah, Andrew, Zhu, Kevin, Panda, Ashwinee, Balwani, Aishwarya
Format:	Preprint
Published:	2025
Subjects:	Machine Learning Artificial Intelligence
Online Access:	https://arxiv.org/abs/2511.02022
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

LoRI: Reducing Cross-Task Interference in Multi-Task Low-Rank Adaptation
by: Zhang, Juzheng, et al.
Published: (2025)

Decomposing Behavioral Phase Transitions in LLMs: Order Parameters for Emergent Misalignment
by: Arnold, Julian, et al.
Published: (2025)

Convergent Linear Representations of Emergent Misalignment
by: Soligo, Anna, et al.
Published: (2025)

A New Linear Scaling Rule for Private Adaptive Hyperparameter Optimization
by: Panda, Ashwinee, et al.
Published: (2022)

Model Organisms for Emergent Misalignment
by: Turner, Edward, et al.
Published: (2025)

Amortized Latent Steering: Low-Cost Alternative to Test-Time Optimization
by: Egbuna, Nathan, et al.
Published: (2025)

Speculating Experts Accelerates Inference for Mixture-of-Experts
by: Madan, Vivan, et al.
Published: (2026)

Peek-a-Boo Reasoning: Contrastive Region Masking in MLLMs
by: Chaturvedi, Isha, et al.
Published: (2025)

Alignment-Constrained Dynamic Pruning for LLMs: Identifying and Preserving Alignment-Critical Circuits
by: Patel, Dev, et al.
Published: (2025)

Understanding Emergent Misalignment via Feature Superposition Geometry
by: Minegishi, Gouki, et al.
Published: (2026)

In-Training Defenses against Emergent Misalignment in Language Models
by: Kaczér, David, et al.
Published: (2025)

FRIT: Using Causal Importance to Improve Chain-of-Thought Faithfulness
by: Swaroop, Anand, et al.
Published: (2025)

Persona-Model Collapse in Emergent Misalignment
by: Costa, Davi Bastos, et al.
Published: (2026)

CommonKV: Compressing KV Cache with Cross-layer Parameter Sharing
by: Wang, Yixuan, et al.
Published: (2025)

Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models
by: Chua, James, et al.
Published: (2025)

Privacy Auditing of Large Language Models
by: Panda, Ashwinee, et al.
Published: (2025)

Dense Backpropagation Improves Training for Sparse Mixture-of-Experts
by: Panda, Ashwinee, et al.
Published: (2025)

From Narrow Unlearning to Emergent Misalignment: Causes, Consequences, and Containment in LLMs
by: Mushtaq, Erum, et al.
Published: (2025)

Persona Features Control Emergent Misalignment
by: Wang, Miles, et al.
Published: (2025)

Shared Lexical Task Representations Explain Behavioral Variability In LLMs
by: Yang, Zhuonan, et al.
Published: (2026)

SALT: Steering Activations towards Leakage-free Thinking in Chain of Thought
by: Batra, Shourya, et al.
Published: (2025)

Emergent and Subliminal Misalignment Through the Lens of Data-Mediated Transfer
by: Askin, Baris, et al.
Published: (2026)

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
by: Betley, Jan, et al.
Published: (2025)

Teach LLMs to Phish: Stealing Private Information from Language Models
by: Panda, Ashwinee, et al.
Published: (2024)

The Devil in the Details: Emergent Misalignment, Format and Coherence in Open-Weights LLMs
by: Dickson, Craig
Published: (2025)

The Master Key Hypothesis: Unlocking Cross-Model Capability Transfer via Linear Subspace Alignment
by: Balasubramanian, Rishab, et al.
Published: (2026)

Distilling Linearized Behavior into Non-Linear Fine-Tuning for Effective Task Arithmetic
by: Sommariva, Thomas, et al.
Published: (2026)

Gemstones: A Model Suite for Multi-Faceted Scaling Laws
by: McLeish, Sean, et al.
Published: (2025)

Moloch's Bargain: Emergent Misalignment When LLMs Compete for Audiences
by: El, Batu, et al.
Published: (2025)

Emergent Misalignment is Easy, Narrow Misalignment is Hard
by: Soligo, Anna, et al.
Published: (2026)

Re-Emergent Misalignment: How Narrow Fine-Tuning Erodes Safety Alignment in LLMs
by: Giordani, Jeremiah
Published: (2025)

ROGUE: Misaligned Agent Behavior Arising from Ordinary Computer Use
by: Tien, Jeremy, et al.
Published: (2026)

On the Emergence of Cross-Task Linearity in the Pretraining-Finetuning Paradigm
by: Zhou, Zhanpeng, et al.
Published: (2024)

QMP: Q-switch Mixture of Policies for Multi-Task Behavior Sharing
by: Zhang, Grace, et al.
Published: (2023)

PASs-MoE: Mitigating Misaligned Co-drift among Router and Experts via Pathway Activation Subspaces for Continual Learning
by: Hou, Zhiyan, et al.
Published: (2026)

BLOCK-EM: Preventing Emergent Misalignment via Latent Blocking
by: Ustaomeroglu, Muhammed, et al.
Published: (2026)

CoPS: Empowering LLM Agents with Provable Cross-Task Experience Sharing
by: Yang, Chen, et al.
Published: (2024)

Overtrained, Not Misaligned
by: Schreiber, Joel, et al.
Published: (2026)

The Latent Color Subspace: Emergent Order in High-Dimensional Chaos
by: Pach, Mateusz, et al.
Published: (2026)

Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality Inversion
by: Mistretta, Marco, et al.
Published: (2025)