Bewaard in:
| Hoofdauteurs: | , |
|---|---|
| Formaat: | Recurso digital |
| Taal: | |
| Gepubliceerd in: |
Zenodo
2025
|
| Online toegang: | https://doi.org/10.5281/zenodo.17815654 |
| Tags: |
Voeg label toe
Geen labels, Wees de eerste die dit record labelt!
|
Inhoudsopgave:
- The opaque nature of large language models (LLMs) presents a significant challenge to their widespread adoption in critical applications. While traditional interpretability methods offer insights into model behavior, they often fall short in identifying the underlying causal mechanisms driving predictions, instead focusing on correlations or feature importance. This paper introduces a novel masking paradigm designed to unmask and explicitly identify causal representations within pre-trained language models, thereby enhancing their explainability and trustworthiness. Our approach posits that by systematically intervening on and observing the resulting changes in model output, we can isolate textual segments or latent features that act as direct causal factors for a given prediction. This paradigm involves a sequence of structured masking operations coupled with counterfactual analysis, allowing for the distinction between merely correlated input features and truly causal ones. We demonstrate how this method can be applied to dissect model decisions across various natural language processing tasks, revealing not only which parts of the input are most influential but also how they causally contribute to the final outcome. Through extensive experimentation on established benchmarks, we show that this novel masking paradigm outperforms existing explainability techniques in terms of fidelity to the model's true causal reasoning and provides more actionable, human-interpretable explanations. The proposed framework represents a significant step towards developing truly explainable and robust AI systems, fostering greater transparency and accountability in the deployment of advanced language models.