:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Golimblevskaia, Elena, Jain, Aakriti, Puri, Bruno, Ibrahim, Ammar, Samek, Wojciech, Lapuschkin, Sebastian
Format:	Preprint
Published:	2025
Subjects:	Machine Learning Artificial Intelligence Computation and Language
Online Access:	https://arxiv.org/abs/2510.14936
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

FADE: Why Bad Descriptions Happen to Good Features
by: Puri, Bruno, et al.
Published: (2025)

Atlas-Alignment: Making Interpretability Transferable Across Language Models
by: Puri, Bruno, et al.
Published: (2025)

AttnLRP: Attention-Aware Layer-Wise Relevance Propagation for Transformers
by: Achtibat, Reduan, et al.
Published: (2024)

Attribution-Guided Pruning for Insight and Control: Circuit Discovery and Targeted Correction in Small-scale LLMs
by: Hatefi, Sayed Mohammad Vakilzadeh, et al.
Published: (2025)

Ensuring Medical AI Safety: Interpretability-Driven Detection and Mitigation of Spurious Model Behavior and Associated Data
by: Pahde, Frederik, et al.
Published: (2025)

PURE: Turning Polysemantic Neurons Into Pure Features by Identifying Relevant Circuits
by: Dreyer, Maximilian, et al.
Published: (2024)

Iterative Inference in a Chess-Playing Neural Network
by: Sandmann, Elias, et al.
Published: (2025)

Post-Hoc Concept Disentanglement: From Correlated to Isolated Concept Representations
by: Erogullari, Eren, et al.
Published: (2025)

Quanda: An Interpretability Toolkit for Training Data Attribution Evaluation and Beyond
by: Bareeva, Dilyara, et al.
Published: (2024)

Explaining Predictive Uncertainty by Exposing Second-Order Effects
by: Bley, Florian, et al.
Published: (2024)

Human-Centered Evaluation of XAI Methods
by: Dawoud, Karam, et al.
Published: (2023)

Reactive Model Correction: Mitigating Harm to Task-Relevant Features via Conditional Bias Suppression
by: Bareeva, Dilyara, et al.
Published: (2024)

Attribution-Guided Decoding
by: Komorowski, Piotr, et al.
Published: (2025)

ECQ$^{\text{x}}$: Explainability-Driven Quantization for Low-Bit and Sparse DNNs
by: Becking, Daniel, et al.
Published: (2021)

Building Trust in PINNs: Error Estimation through Finite Difference Methods
by: Krasowski, Aleksander, et al.
Published: (2026)

Navigating Neural Space: Revisiting Concept Activation Vectors to Overcome Directional Divergence
by: Pahde, Frederik, et al.
Published: (2022)

Sparse, Efficient and Explainable Data Attribution with DualXDA
by: Yolcu, Galip Ümit, et al.
Published: (2024)

A Close Look at Decomposition-based XAI-Methods for Transformer Language Models
by: Arras, Leila, et al.
Published: (2025)

From Attribution to Action: A Human-Centered Application of Activation Steering
by: Labarta, Tobias, et al.
Published: (2026)

Judge Circuits
by: Feldhus, Nils, et al.
Published: (2026)

The Atlas of In-Context Learning: How Attention Heads Shape In-Context Retrieval Augmentation
by: Kahardipraja, Patrick, et al.
Published: (2025)

Relevance-driven Input Dropout: an Explanation-guided Regularization Technique
by: Gururaj, Shreyas, et al.
Published: (2025)

From What to How: Attributing CLIP's Latent Components Reveals Unexpected Semantic Reliance
by: Dreyer, Maximilian, et al.
Published: (2025)

Pruning By Explaining Revisited: Optimizing Attribution Methods to Prune CNNs and Transformers
by: Hatefi, Sayed Mohammad Vakilzadeh, et al.
Published: (2024)

Beyond Components: Singular Vector-Based Interpretability of Transformer Circuits
by: Ahmad, Areeb, et al.
Published: (2025)

From Attribution Maps to Human-Understandable Explanations through Concept Relevance Propagation
by: Achtibat, Reduan, et al.
Published: (2022)

LieSolver: A PDE-constrained solver for IBVPs using Lie symmetries
by: Klausen, René P., et al.
Published: (2025)

Mechanistic understanding and validation of large AI models with SemanticLens
by: Dreyer, Maximilian, et al.
Published: (2025)

Efficient and Flexible Neural Network Training through Layer-wise Feedback Propagation
by: Weber, Leander, et al.
Published: (2023)

PINNfluence: Influence Functions for Physics-Informed Neural Networks
by: Naujoks, Jonas R., et al.
Published: (2024)

Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models
by: Lan, Michael, et al.
Published: (2023)

Leveraging Influence Functions for Resampling Data in Physics-Informed Neural Networks
by: Naujoks, Jonas R., et al.
Published: (2025)

Reaching Beyond the Mode: RL for Distributional Reasoning in Language Models
by: Puri, Isha, et al.
Published: (2026)

CoSy: Evaluating Textual Explanations of Neurons
by: Kopf, Laura, et al.
Published: (2024)

Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty
by: Damani, Mehul, et al.
Published: (2025)

Dissecting Persona-Driven Reasoning in Language Models via Activation Patching
by: Poonia, Ansh, et al.
Published: (2025)

Understanding the (Extra-)Ordinary: Validating Deep Model Decisions with Prototypical Concept-based Explanations
by: Dreyer, Maximilian, et al.
Published: (2023)

Task-Circuit Quantization: Leveraging Knowledge Localization and Interpretability for Compression
by: Xiao, Hanqi, et al.
Published: (2025)

Reasoning Circuits in Language Models: A Mechanistic Interpretation of Syllogistic Inference
by: Kim, Geonhee, et al.
Published: (2024)

Model Science: getting serious about verification, explanation and control of AI systems
by: Biecek, Przemyslaw, et al.
Published: (2025)