:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Gauderis, Ward, Dooms, Thomas, Holmer, Steven T., Ayonrinde, Kola, Wiggins, Geraint A.
Format:	Preprint
Published:	2026
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2605.08934
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Compositionality Unlocks Deep Interpretable Models
by: Dooms, Thomas, et al.
Published: (2025)

Bilinear autoencoders find interpretable manifolds
by: Dooms, Thomas, et al.
Published: (2026)

Finding Manifolds With Bilinear Autoencoders
by: Dooms, Thomas, et al.
Published: (2025)

A Mathematical Philosophy of Explanations in Mechanistic Interpretability -- The Strange Science Part I.i
by: Ayonrinde, Kola, et al.
Published: (2025)

Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability -- The Strange Science Part I.ii
by: Ayonrinde, Kola, et al.
Published: (2025)

Adaptive Sparse Allocation with Mutual Choice & Feature Choice Sparse Autoencoders
by: Ayonrinde, Kola
Published: (2024)

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs
by: Ayonrinde, Kola, et al.
Published: (2024)

Quantum Methods for Managing Ambiguity in Natural Language Processing
by: Eisinger, Jurek, et al.
Published: (2025)

When Are Two Networks the Same? Tensor Similarity for Mechanistic Interpretability
by: Gonzalez, ML Nissen, et al.
Published: (2026)

BioOSS: A Bio-Inspired Oscillatory State System with Spatio-Temporal Dynamics
by: Yuan, Zhongju, et al.
Published: (2025)

Tokenized SAEs: Disentangling SAE Reconstructions
by: Dooms, Thomas, et al.
Published: (2025)

Towards a Formal Creativity Theory: Preliminary results in Novelty and Transformativeness
by: Santo, Luís Espírito, et al.
Published: (2024)

A novel Reservoir Architecture for Periodic Time Series Prediction
by: Yuan, Zhongju, et al.
Published: (2024)

Fractals made Practical: Denoising Diffusion as Partitioned Iterated Function Systems
by: Dooms, Ann
Published: (2026)

Weight-based Decomposition: A Case for Bilinear MLPs
by: Pearce, Michael T., et al.
Published: (2024)

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability
by: Karvonen, Adam, et al.
Published: (2025)

Bilinear MLPs enable weight-based mechanistic interpretability
by: Pearce, Michael T., et al.
Published: (2024)

Exemplar Partitioning for Mechanistic Interpretability
by: Rumbelow, Jessica
Published: (2026)

Open Problems in Mechanistic Interpretability
by: Sharkey, Lee, et al.
Published: (2025)

Mechanistic Interpretability for Neural TSP Solvers
by: Narad, Reuben, et al.
Published: (2025)

Mechanistic Interpretability of Reinforcement Learning Agents
by: Trim, Tristan, et al.
Published: (2024)

Validating Mechanistic Interpretations: An Axiomatic Approach
by: Palumbo, Nils, et al.
Published: (2024)

Mechanistic Interpretability for Transformer-based Time Series Classification
by: Kalnāre, Matīss, et al.
Published: (2025)

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
by: Sutter, Denis, et al.
Published: (2025)

InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques
by: Gupta, Rohan, et al.
Published: (2024)

Neural Network-Based Piecewise Survival Models
by: Holmer, Olov, et al.
Published: (2024)

Usage-Specific Survival Modeling Based on Operational Data and Neural Networks
by: Holmer, Olov, et al.
Published: (2024)

Geospatial Mechanistic Interpretability of Large Language Models
by: De Sabbata, Stef, et al.
Published: (2025)

Using Degeneracy in the Loss Landscape for Mechanistic Interpretability
by: Bushnaq, Lucius, et al.
Published: (2024)

Challenges in Mechanistically Interpreting Model Representations
by: Golechha, Satvik, et al.
Published: (2024)

Mechanistic Interpretability of Binary and Ternary Transformers
by: Li, Jason
Published: (2024)

Bridging Mechanistic Interpretability and Prompt Engineering with Gradient Ascent for Interpretable Persona Control
by: Saini, Harshvardhan, et al.
Published: (2026)

Compact Proofs of Model Performance via Mechanistic Interpretability
by: Gross, Jason, et al.
Published: (2024)

Mechanistic Interpretability of RNNs emulating Hidden Markov Models
by: Torre, Elia, et al.
Published: (2025)

Interpretable Deep Learning for Polar Mechanistic Reaction Prediction
by: Miller, Ryan J., et al.
Published: (2025)

Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models
by: Winninger, Thomas, et al.
Published: (2025)

MINAR: Mechanistic Interpretability for Neural Algorithmic Reasoning
by: He, Jesse, et al.
Published: (2026)

Triangulation as an Acceptance Rule for Multilingual Mechanistic Interpretability
by: Long, Yanan
Published: (2025)

MIB: A Mechanistic Interpretability Benchmark
by: Mueller, Aaron, et al.
Published: (2025)

Group Equivariance Meets Mechanistic Interpretability: Equivariant Sparse Autoencoders
by: Erdogan, Ege, et al.
Published: (2025)