Saved in:
Bibliographic Details
Main Authors: Lin, Zezheng, Liu, Fengming
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2605.08012
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866915994148012032
author Lin, Zezheng
Liu, Fengming
author_facet Lin, Zezheng
Liu, Fengming
contents Mechanistic interpretability papers increasingly use causal vocabulary: circuits, mediators, causal abstraction, monosemanticity. Such claims require explicit identification assumptions. A purposive audit of 10 papers across four methodological strands finds no dedicated identification-assumptions section and a recurring pattern: validation metrics such as faithfulness, completeness, monosemanticity, alignment, or ablation effects are reported as causal support without stating the assumptions that make them identifying. A two-human-coder audit on $n=30$ reproduces the direction of the main finding: dedicated identification sections are absent, and validation-metric substitution is common, though exact Dim B/D counts are coding-rule sensitive. The paper proposes a disclosure norm: state whether the claim is causal, name the identification strategy, enumerate assumptions, stress at least one, and explain how conclusions shift if assumptions fail. Validation is not identification.
format Preprint
id arxiv_https___arxiv_org_abs_2605_08012
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Position: Mechanistic Interpretability Must Disclose Identification Assumptions for Causal Claims
Lin, Zezheng
Liu, Fengming
Machine Learning
Artificial Intelligence
Computation and Language
68T07
I.2.6; I.2.0
Mechanistic interpretability papers increasingly use causal vocabulary: circuits, mediators, causal abstraction, monosemanticity. Such claims require explicit identification assumptions. A purposive audit of 10 papers across four methodological strands finds no dedicated identification-assumptions section and a recurring pattern: validation metrics such as faithfulness, completeness, monosemanticity, alignment, or ablation effects are reported as causal support without stating the assumptions that make them identifying. A two-human-coder audit on $n=30$ reproduces the direction of the main finding: dedicated identification sections are absent, and validation-metric substitution is common, though exact Dim B/D counts are coding-rule sensitive. The paper proposes a disclosure norm: state whether the claim is causal, name the identification strategy, enumerate assumptions, stress at least one, and explain how conclusions shift if assumptions fail. Validation is not identification.
title Position: Mechanistic Interpretability Must Disclose Identification Assumptions for Causal Claims
topic Machine Learning
Artificial Intelligence
Computation and Language
68T07
I.2.6; I.2.0
url https://arxiv.org/abs/2605.08012