Salvato in:
Dettagli Bibliografici
Autori principali: Eichin, Florian, Du, Yupei, Mondorf, Philipp, Matveev, Maria, Plank, Barbara, Hedderich, Michael A.
Natura: Preprint
Pubblicazione: 2025
Soggetti:
Accesso online:https://arxiv.org/abs/2505.20076
Tags: Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!
_version_ 1866914068883832832
author Eichin, Florian
Du, Yupei
Mondorf, Philipp
Matveev, Maria
Plank, Barbara
Hedderich, Michael A.
author_facet Eichin, Florian
Du, Yupei
Mondorf, Philipp
Matveev, Maria
Plank, Barbara
Hedderich, Michael A.
contents Post-hoc interpretability methods typically attribute a model's behavior to its components, data, or training trajectory in isolation. This leads to explanations that lack a unified view and may miss key interactions. While combining existing methods or applying them at different training stages offers broader insights, such approaches usually lack theoretical support. In this work, we present ExPLAIND, a unified framework that integrates all these perspectives. First, we generalize recent work on gradient path kernels, which reformulate models trained by gradient descent as a kernel machine, to realistic settings like AdamW. We empirically validate that a CNN and a Transformer are accurately replicated by this reformulation. Second, we derive novel parameter- and step-wise influence scores from the kernel feature maps. Their effectiveness for parameter pruning is comparable to existing methods, demonstrating their value for model component attribution. Finally, jointly interpreting model components and data over the training process, we leverage ExPLAIND to analyze a Transformer that exhibits Grokking. Our findings support previously proposed stages of Grokking, while refining the final phase as one of alignment of input embeddings and final layers around a representation pipeline learned after the memorization phase. Overall, ExPLAIND provides a theoretically grounded, unified framework to interpret model behavior and training dynamics.
format Preprint
id arxiv_https___arxiv_org_abs_2505_20076
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle ExPLAIND: Unifying Model, Data, and Training Attribution to Study Model Behavior
Eichin, Florian
Du, Yupei
Mondorf, Philipp
Matveev, Maria
Plank, Barbara
Hedderich, Michael A.
Machine Learning
Post-hoc interpretability methods typically attribute a model's behavior to its components, data, or training trajectory in isolation. This leads to explanations that lack a unified view and may miss key interactions. While combining existing methods or applying them at different training stages offers broader insights, such approaches usually lack theoretical support. In this work, we present ExPLAIND, a unified framework that integrates all these perspectives. First, we generalize recent work on gradient path kernels, which reformulate models trained by gradient descent as a kernel machine, to realistic settings like AdamW. We empirically validate that a CNN and a Transformer are accurately replicated by this reformulation. Second, we derive novel parameter- and step-wise influence scores from the kernel feature maps. Their effectiveness for parameter pruning is comparable to existing methods, demonstrating their value for model component attribution. Finally, jointly interpreting model components and data over the training process, we leverage ExPLAIND to analyze a Transformer that exhibits Grokking. Our findings support previously proposed stages of Grokking, while refining the final phase as one of alignment of input embeddings and final layers around a representation pipeline learned after the memorization phase. Overall, ExPLAIND provides a theoretically grounded, unified framework to interpret model behavior and training dynamics.
title ExPLAIND: Unifying Model, Data, and Training Attribution to Study Model Behavior
topic Machine Learning
url https://arxiv.org/abs/2505.20076