Saved in:
Bibliographic Details
Main Authors: Arturi, Daniel Aarao Reis, Zhang, Eric, Ansah, Andrew, Zhu, Kevin, Panda, Ashwinee, Balwani, Aishwarya
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2511.02022
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866914133564194816
author Arturi, Daniel Aarao Reis
Zhang, Eric
Ansah, Andrew
Zhu, Kevin
Panda, Ashwinee
Balwani, Aishwarya
author_facet Arturi, Daniel Aarao Reis
Zhang, Eric
Ansah, Andrew
Zhu, Kevin
Panda, Ashwinee
Balwani, Aishwarya
contents Recent work has discovered that large language models can develop broadly misaligned behaviors after being fine-tuned on narrowly harmful datasets, a phenomenon known as emergent misalignment (EM). However, the fundamental mechanisms enabling such harmful generalization across disparate domains remain poorly understood. In this work, we adopt a geometric perspective to study EM and demonstrate that it exhibits a fundamental cross-task linear structure in how harmful behavior is encoded across different datasets. Specifically, we find a strong convergence in EM parameters across tasks, with the fine-tuned weight updates showing relatively high cosine similarities, as well as shared lower-dimensional subspaces as measured by their principal angles and projection overlaps. Furthermore, we also show functional equivalence via linear mode connectivity, wherein interpolated models across narrow misalignment tasks maintain coherent, broadly misaligned behavior. Our results indicate that EM arises from different narrow tasks discovering the same set of shared parameter directions, suggesting that harmful behaviors may be organized into specific, predictable regions of the weight landscape. By revealing this fundamental connection between parametric geometry and behavioral outcomes, we hope our work catalyzes further research on parameter space interpretability and weight-based interventions.
format Preprint
id arxiv_https___arxiv_org_abs_2511_02022
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Shared Parameter Subspaces and Cross-Task Linearity in Emergently Misaligned Behavior
Arturi, Daniel Aarao Reis
Zhang, Eric
Ansah, Andrew
Zhu, Kevin
Panda, Ashwinee
Balwani, Aishwarya
Machine Learning
Artificial Intelligence
Recent work has discovered that large language models can develop broadly misaligned behaviors after being fine-tuned on narrowly harmful datasets, a phenomenon known as emergent misalignment (EM). However, the fundamental mechanisms enabling such harmful generalization across disparate domains remain poorly understood. In this work, we adopt a geometric perspective to study EM and demonstrate that it exhibits a fundamental cross-task linear structure in how harmful behavior is encoded across different datasets. Specifically, we find a strong convergence in EM parameters across tasks, with the fine-tuned weight updates showing relatively high cosine similarities, as well as shared lower-dimensional subspaces as measured by their principal angles and projection overlaps. Furthermore, we also show functional equivalence via linear mode connectivity, wherein interpolated models across narrow misalignment tasks maintain coherent, broadly misaligned behavior. Our results indicate that EM arises from different narrow tasks discovering the same set of shared parameter directions, suggesting that harmful behaviors may be organized into specific, predictable regions of the weight landscape. By revealing this fundamental connection between parametric geometry and behavioral outcomes, we hope our work catalyzes further research on parameter space interpretability and weight-based interventions.
title Shared Parameter Subspaces and Cross-Task Linearity in Emergently Misaligned Behavior
topic Machine Learning
Artificial Intelligence
url https://arxiv.org/abs/2511.02022