Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Venkatesh, Sohan, Kurapath, Ashish Mahendran
Format:	Preprint
Published:	2026
Subjects:	Machine Learning Artificial Intelligence
Online Access:	https://arxiv.org/abs/2602.06801
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911559259783168
author	Venkatesh, Sohan Kurapath, Ashish Mahendran
author_facet	Venkatesh, Sohan Kurapath, Ashish Mahendran
contents	Activation steering methods are widely used to control large language model (LLM) behavior and are often interpreted as revealing meaningful internal representations. This interpretation assumes that steering directions are identifiable and uniquely recoverable from input-output behavior. We show that, under white-box single-layer access, steering vectors are fundamentally non-identifiable due to large equivalence classes of behaviorally indistinguishable interventions. Empirically, we find that orthogonal perturbations achieve near-equivalent efficacy with negligible effect sizes across multiple models and traits, with pre-trained semantic classifiers confirming equivalence at the output level. We estimate null-space dimensionality via SVD of activation covariance matrices and validate that equivalence holds robustly throughout the operationally relevant steering range. Critically, we show that non-identifiability is a robust geometric property that persists across diverse prompt distributions. These findings reveal fundamental interpretability limits and highlight the need for structural constraints beyond behavioral testing to enable reliable alignment interventions.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_06801
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	On the Non-Identifiability of Steering Vectors in Large Language Models Venkatesh, Sohan Kurapath, Ashish Mahendran Machine Learning Artificial Intelligence Activation steering methods are widely used to control large language model (LLM) behavior and are often interpreted as revealing meaningful internal representations. This interpretation assumes that steering directions are identifiable and uniquely recoverable from input-output behavior. We show that, under white-box single-layer access, steering vectors are fundamentally non-identifiable due to large equivalence classes of behaviorally indistinguishable interventions. Empirically, we find that orthogonal perturbations achieve near-equivalent efficacy with negligible effect sizes across multiple models and traits, with pre-trained semantic classifiers confirming equivalence at the output level. We estimate null-space dimensionality via SVD of activation covariance matrices and validate that equivalence holds robustly throughout the operationally relevant steering range. Critically, we show that non-identifiability is a robust geometric property that persists across diverse prompt distributions. These findings reveal fundamental interpretability limits and highlight the need for structural constraints beyond behavioral testing to enable reliable alignment interventions.
title	On the Non-Identifiability of Steering Vectors in Large Language Models
topic	Machine Learning Artificial Intelligence
url	https://arxiv.org/abs/2602.06801

Similar Items