Saved in:
Bibliographic Details
Main Author: Ye, Donald
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.01442
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866917473731739648
author Ye, Donald
author_facet Ye, Donald
contents Gradient-based attribution is the workhorse of mechanistic interpretability, yet whether it reliably tracks causal importance at the component level remains largely untested. We causally evaluate this assumption across two algorithmic tasks and up to 10 random seeds, uncovering a systematic, layer-wise failure: gradient attribution consistently overvalues early-layer \textbf{Gradient Bloats} and undervalues late-layer \textbf{Hidden Heroes}. Rank correlation collapses from $ρ= 0.72$ on sequence reversal to $0.27$ on sequence sorting, reaching $ρ= -0.18$ in individual seeds. This failure stems from first-order gradient attribution's inability to detect collective redundancy: joint Bloat ablation causes $14\times$ greater damage than individual results predict. Consequently, Bloats dominate gradient rankings despite negligible functional impact, while ablating Hidden Heroes destroys OOD accuracy ($-36.4\% \pm 22.8\%$). This systematic inversion of early-layer feature extraction and late-layer computation motivates causal validation as a prerequisite for circuit-level claims.
format Preprint
id arxiv_https___arxiv_org_abs_2602_01442
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Hidden Heroes and Gradient Bloats: Layer-Wise Redundancy Inverts Attribution in Transformers
Ye, Donald
Machine Learning
Artificial Intelligence
Computation and Language
Gradient-based attribution is the workhorse of mechanistic interpretability, yet whether it reliably tracks causal importance at the component level remains largely untested. We causally evaluate this assumption across two algorithmic tasks and up to 10 random seeds, uncovering a systematic, layer-wise failure: gradient attribution consistently overvalues early-layer \textbf{Gradient Bloats} and undervalues late-layer \textbf{Hidden Heroes}. Rank correlation collapses from $ρ= 0.72$ on sequence reversal to $0.27$ on sequence sorting, reaching $ρ= -0.18$ in individual seeds. This failure stems from first-order gradient attribution's inability to detect collective redundancy: joint Bloat ablation causes $14\times$ greater damage than individual results predict. Consequently, Bloats dominate gradient rankings despite negligible functional impact, while ablating Hidden Heroes destroys OOD accuracy ($-36.4\% \pm 22.8\%$). This systematic inversion of early-layer feature extraction and late-layer computation motivates causal validation as a prerequisite for circuit-level claims.
title Hidden Heroes and Gradient Bloats: Layer-Wise Redundancy Inverts Attribution in Transformers
topic Machine Learning
Artificial Intelligence
Computation and Language
url https://arxiv.org/abs/2602.01442