Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Author:	Ye, Donald
Format:	Preprint
Published:	2026
Subjects:	Machine Learning Artificial Intelligence Computation and Language
Online Access:	https://arxiv.org/abs/2602.01442
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917473731739648
author	Ye, Donald
author_facet	Ye, Donald
contents	Gradient-based attribution is the workhorse of mechanistic interpretability, yet whether it reliably tracks causal importance at the component level remains largely untested. We causally evaluate this assumption across two algorithmic tasks and up to 10 random seeds, uncovering a systematic, layer-wise failure: gradient attribution consistently overvalues early-layer \textbf{Gradient Bloats} and undervalues late-layer \textbf{Hidden Heroes}. Rank correlation collapses from $ρ= 0.72$ on sequence reversal to $0.27$ on sequence sorting, reaching $ρ= -0.18$ in individual seeds. This failure stems from first-order gradient attribution's inability to detect collective redundancy: joint Bloat ablation causes $14\times$ greater damage than individual results predict. Consequently, Bloats dominate gradient rankings despite negligible functional impact, while ablating Hidden Heroes destroys OOD accuracy ($-36.4\% \pm 22.8\%$). This systematic inversion of early-layer feature extraction and late-layer computation motivates causal validation as a prerequisite for circuit-level claims.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_01442
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Hidden Heroes and Gradient Bloats: Layer-Wise Redundancy Inverts Attribution in Transformers Ye, Donald Machine Learning Artificial Intelligence Computation and Language Gradient-based attribution is the workhorse of mechanistic interpretability, yet whether it reliably tracks causal importance at the component level remains largely untested. We causally evaluate this assumption across two algorithmic tasks and up to 10 random seeds, uncovering a systematic, layer-wise failure: gradient attribution consistently overvalues early-layer \textbf{Gradient Bloats} and undervalues late-layer \textbf{Hidden Heroes}. Rank correlation collapses from $ρ= 0.72$ on sequence reversal to $0.27$ on sequence sorting, reaching $ρ= -0.18$ in individual seeds. This failure stems from first-order gradient attribution's inability to detect collective redundancy: joint Bloat ablation causes $14\times$ greater damage than individual results predict. Consequently, Bloats dominate gradient rankings despite negligible functional impact, while ablating Hidden Heroes destroys OOD accuracy ($-36.4\% \pm 22.8\%$). This systematic inversion of early-layer feature extraction and late-layer computation motivates causal validation as a prerequisite for circuit-level claims.
title	Hidden Heroes and Gradient Bloats: Layer-Wise Redundancy Inverts Attribution in Transformers
topic	Machine Learning Artificial Intelligence Computation and Language
url	https://arxiv.org/abs/2602.01442

Similar Items