:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zhang, Yang, Li, Yawei, Brown, Hannah, Rezaei, Mina, Bischl, Bernd, Torr, Philip, Khakzar, Ashkan, Kawaguchi, Kenji
Format:	Preprint
Published:	2023
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2310.06514
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

A Dual-Perspective Approach to Evaluating Feature Attribution Methods
by: Li, Yawei, et al.
Published: (2023)

FinerCut: Finer-grained Interpretable Layer Pruning for Large Language Models
by: Zhang, Yang, et al.
Published: (2024)

Calibrating LLMs with Information-Theoretic Evidential Deep Learning
by: Li, Yawei, et al.
Published: (2025)

On Discprecncies between Perturbation Evaluations of Graph Neural Network Attributions
by: Rezaei, Razieh, et al.
Published: (2024)

RelP: Faithful and Efficient Circuit Discovery in Language Models via Relevance Patching
by: Jafari, Farnoush Rezaei, et al.
Published: (2025)

Minimalist Concept Erasure in Generative Models
by: Zhang, Yang, et al.
Published: (2025)

How Visual Representations Map to Language Feature Space in Multimodal LLMs
by: Venhoff, Constantin, et al.
Published: (2025)

Quantifying Feature Space Universality Across Large Language Models via Sparse Autoencoders
by: Lan, Michael, et al.
Published: (2024)

Too Late to Recall: Explaining the Two-Hop Problem in Multimodal Knowledge Retrieval
by: Venhoff, Constantin, et al.
Published: (2025)

Structured Credal Learning
by: Venkatesh, Varun, et al.
Published: (2026)

What Is Fairness? On the Role of Protected Attributes and Fictitious Worlds
by: Bothmann, Ludwig, et al.
Published: (2022)

Learnable Sparsity for Vision Generative Models
by: Zhang, Yang, et al.
Published: (2024)

Towards Understanding Multimodal Fine-Tuning: Spatial Features
by: Naghashyar, Lachin, et al.
Published: (2026)

Uncertainty-aware Pseudo-label Selection for Positive-Unlabeled Learning
by: Dorigatti, Emilio, et al.
Published: (2022)

Latent Guard: a Safety Framework for Text-to-image Generation
by: Liu, Runtao, et al.
Published: (2024)

Learning Visual Prompts for Guiding the Attention of Vision Transformers
by: Rezaei, Razieh, et al.
Published: (2024)

Mixture of Experts Made Intrinsically Interpretable
by: Yang, Xingyi, et al.
Published: (2025)

The Attribution Impossibility: No Feature Ranking Is Faithful, Stable, and Complete Under Collinearity
by: Caraker, Drake, et al.
Published: (2026)

Articulate3D: Zero-Shot Text-Driven 3D Object Posing
by: Deb, Oishi, et al.
Published: (2025)

Efficient Document Embeddings via Self-Contrastive Bregman Divergence Learning
by: Saggau, Daniel, et al.
Published: (2023)

Normalized AOPC: Fixing Misleading Faithfulness Metrics for Feature Attribution Explainability
by: Edin, Joakim, et al.
Published: (2024)

AlignGuard: Scalable Safety Alignment for Text-to-Image Generation
by: Liu, Runtao, et al.
Published: (2024)

Ideal Attribution and Faithful Watermarks for Language Models
by: Song, Min Jae, et al.
Published: (2025)

Faithfulness Evaluation for Decoder-only LLM Attributions with Controlled Retained Information
by: Huang, Xin, et al.
Published: (2026)

On the Robustness of Global Feature Effect Explanations
by: Baniecki, Hubert, et al.
Published: (2024)

Incorporating Attribution Importance for Improving Faithfulness Metrics
by: Zhao, Zhixue, et al.
Published: (2023)

Beyond Output Faithfulness: Learning Attributions that Preserve Computational Pathways
by: Zhang, Siyu, et al.
Published: (2025)

Single Character Perturbations Break LLM Alignment
by: Lin, Leon, et al.
Published: (2024)

The Cognitive Revolution in Interpretability: From Explaining Behavior to Interpreting Representations and Algorithms
by: Davies, Adam, et al.
Published: (2024)

The Attribution Contract: Feature Attribution for Generative Language Models
by: Nguyen, Giang
Published: (2026)

Self-Evaluation as a Defense Against Adversarial Attacks on LLMs
by: Brown, Hannah, et al.
Published: (2024)

Analyzing Error Sources in Global Feature Effect Estimation
by: Heiß, Timo, et al.
Published: (2026)

Post-hoc Orthogonalization for Mitigation of Protected Feature Bias in CXR Embeddings
by: Weber, Tobias, et al.
Published: (2023)

Decomposing Global Feature Effects Based on Feature Interactions
by: Herbinger, Julia, et al.
Published: (2023)

ABE: A Unified Framework for Robust and Faithful Attribution-Based Explainability
by: Zhu, Zhiyu, et al.
Published: (2025)

Wrapper Boxes: Faithful Attribution of Model Predictions to Training Data
by: Su, Yiheng, et al.
Published: (2023)

Impossibility Theorems for Feature Attribution
by: Bilodeau, Blair, et al.
Published: (2022)

How Faithful Is Trajectory-Based Data Attribution? Error Sources, Remedies, and Practical Guidelines
by: Deng, Junwei, et al.
Published: (2026)

Feature Attribution from First Principles
by: Taimeskhanov, Magamed, et al.
Published: (2025)

Disentangling Interactions and Dependencies in Feature Attribution
by: König, Gunnar, et al.
Published: (2024)