Saved in:
Bibliographic Details
Main Authors: Venkatesh, Sohan, Kurapath, Ashish Mahendran
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.06801
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911559259783168
author Venkatesh, Sohan
Kurapath, Ashish Mahendran
author_facet Venkatesh, Sohan
Kurapath, Ashish Mahendran
contents Activation steering methods are widely used to control large language model (LLM) behavior and are often interpreted as revealing meaningful internal representations. This interpretation assumes that steering directions are identifiable and uniquely recoverable from input-output behavior. We show that, under white-box single-layer access, steering vectors are fundamentally non-identifiable due to large equivalence classes of behaviorally indistinguishable interventions. Empirically, we find that orthogonal perturbations achieve near-equivalent efficacy with negligible effect sizes across multiple models and traits, with pre-trained semantic classifiers confirming equivalence at the output level. We estimate null-space dimensionality via SVD of activation covariance matrices and validate that equivalence holds robustly throughout the operationally relevant steering range. Critically, we show that non-identifiability is a robust geometric property that persists across diverse prompt distributions. These findings reveal fundamental interpretability limits and highlight the need for structural constraints beyond behavioral testing to enable reliable alignment interventions.
format Preprint
id arxiv_https___arxiv_org_abs_2602_06801
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle On the Non-Identifiability of Steering Vectors in Large Language Models
Venkatesh, Sohan
Kurapath, Ashish Mahendran
Machine Learning
Artificial Intelligence
Activation steering methods are widely used to control large language model (LLM) behavior and are often interpreted as revealing meaningful internal representations. This interpretation assumes that steering directions are identifiable and uniquely recoverable from input-output behavior. We show that, under white-box single-layer access, steering vectors are fundamentally non-identifiable due to large equivalence classes of behaviorally indistinguishable interventions. Empirically, we find that orthogonal perturbations achieve near-equivalent efficacy with negligible effect sizes across multiple models and traits, with pre-trained semantic classifiers confirming equivalence at the output level. We estimate null-space dimensionality via SVD of activation covariance matrices and validate that equivalence holds robustly throughout the operationally relevant steering range. Critically, we show that non-identifiability is a robust geometric property that persists across diverse prompt distributions. These findings reveal fundamental interpretability limits and highlight the need for structural constraints beyond behavioral testing to enable reliable alignment interventions.
title On the Non-Identifiability of Steering Vectors in Large Language Models
topic Machine Learning
Artificial Intelligence
url https://arxiv.org/abs/2602.06801