Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Arturi, Daniel Aarao Reis, Zhang, Eric, Ansah, Andrew, Zhu, Kevin, Panda, Ashwinee, Balwani, Aishwarya
Format:	Preprint
Published:	2025
Subjects:	Machine Learning Artificial Intelligence
Online Access:	https://arxiv.org/abs/2511.02022
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914133564194816
author	Arturi, Daniel Aarao Reis Zhang, Eric Ansah, Andrew Zhu, Kevin Panda, Ashwinee Balwani, Aishwarya
author_facet	Arturi, Daniel Aarao Reis Zhang, Eric Ansah, Andrew Zhu, Kevin Panda, Ashwinee Balwani, Aishwarya
contents	Recent work has discovered that large language models can develop broadly misaligned behaviors after being fine-tuned on narrowly harmful datasets, a phenomenon known as emergent misalignment (EM). However, the fundamental mechanisms enabling such harmful generalization across disparate domains remain poorly understood. In this work, we adopt a geometric perspective to study EM and demonstrate that it exhibits a fundamental cross-task linear structure in how harmful behavior is encoded across different datasets. Specifically, we find a strong convergence in EM parameters across tasks, with the fine-tuned weight updates showing relatively high cosine similarities, as well as shared lower-dimensional subspaces as measured by their principal angles and projection overlaps. Furthermore, we also show functional equivalence via linear mode connectivity, wherein interpolated models across narrow misalignment tasks maintain coherent, broadly misaligned behavior. Our results indicate that EM arises from different narrow tasks discovering the same set of shared parameter directions, suggesting that harmful behaviors may be organized into specific, predictable regions of the weight landscape. By revealing this fundamental connection between parametric geometry and behavioral outcomes, we hope our work catalyzes further research on parameter space interpretability and weight-based interventions.
format	Preprint
id	arxiv_https___arxiv_org_abs_2511_02022
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Shared Parameter Subspaces and Cross-Task Linearity in Emergently Misaligned Behavior Arturi, Daniel Aarao Reis Zhang, Eric Ansah, Andrew Zhu, Kevin Panda, Ashwinee Balwani, Aishwarya Machine Learning Artificial Intelligence Recent work has discovered that large language models can develop broadly misaligned behaviors after being fine-tuned on narrowly harmful datasets, a phenomenon known as emergent misalignment (EM). However, the fundamental mechanisms enabling such harmful generalization across disparate domains remain poorly understood. In this work, we adopt a geometric perspective to study EM and demonstrate that it exhibits a fundamental cross-task linear structure in how harmful behavior is encoded across different datasets. Specifically, we find a strong convergence in EM parameters across tasks, with the fine-tuned weight updates showing relatively high cosine similarities, as well as shared lower-dimensional subspaces as measured by their principal angles and projection overlaps. Furthermore, we also show functional equivalence via linear mode connectivity, wherein interpolated models across narrow misalignment tasks maintain coherent, broadly misaligned behavior. Our results indicate that EM arises from different narrow tasks discovering the same set of shared parameter directions, suggesting that harmful behaviors may be organized into specific, predictable regions of the weight landscape. By revealing this fundamental connection between parametric geometry and behavioral outcomes, we hope our work catalyzes further research on parameter space interpretability and weight-based interventions.
title	Shared Parameter Subspaces and Cross-Task Linearity in Emergently Misaligned Behavior
topic	Machine Learning Artificial Intelligence
url	https://arxiv.org/abs/2511.02022

Similar Items