Internformat: :: Library Catalog

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Aljaafari, Nura, Carvalho, Danilo S., Freitas, André
Format:	Preprint
Veröffentlicht:	2025
Schlagworte:	Computation and Language
Online-Zugang:	https://arxiv.org/abs/2507.03668
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

_version_	1866908434494914560
author	Aljaafari, Nura Carvalho, Danilo S. Freitas, André
author_facet	Aljaafari, Nura Carvalho, Danilo S. Freitas, André
contents	Understanding when and how linguistic knowledge emerges during language model training remains a central challenge for interpretability. Most existing tools are post hoc, rely on scalar metrics, or require nontrivial integration effort, making comprehensive interpretability analysis difficult to deploy and maintain. We introduce TRACE, a modular toolkit for training and inference-time interpretability analysis of transformer models. It enables lightweight, in-training analysis of linguistic and representational signals, including features probing, intrinsic dimensionality, Hessian curvature, and output diagnostics. It integrates with ABSynth, a controllable synthetic corpus generator that provides structured annotations for precise evaluation of linguistic feature acquisition. Experiments with autoregressive transformers demonstrate that TRACE reveals developmental phenomena such as early syntactic emergence, delayed semantic acquisition, and representational compression, signals overlooked by traditional scalar metrics such as loss or accuracy. With minimal integration effort, the tool enables layer-wise diagnostics, convergence-based early stopping, and detection of structural errors, making transformer analysis interpretable, actionable, and reproducible.
format	Preprint
id	arxiv_https___arxiv_org_abs_2507_03668
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	TRACE: Training and Inference-Time Interpretability Analysis for Language Models Aljaafari, Nura Carvalho, Danilo S. Freitas, André Computation and Language Understanding when and how linguistic knowledge emerges during language model training remains a central challenge for interpretability. Most existing tools are post hoc, rely on scalar metrics, or require nontrivial integration effort, making comprehensive interpretability analysis difficult to deploy and maintain. We introduce TRACE, a modular toolkit for training and inference-time interpretability analysis of transformer models. It enables lightweight, in-training analysis of linguistic and representational signals, including features probing, intrinsic dimensionality, Hessian curvature, and output diagnostics. It integrates with ABSynth, a controllable synthetic corpus generator that provides structured annotations for precise evaluation of linguistic feature acquisition. Experiments with autoregressive transformers demonstrate that TRACE reveals developmental phenomena such as early syntactic emergence, delayed semantic acquisition, and representational compression, signals overlooked by traditional scalar metrics such as loss or accuracy. With minimal integration effort, the tool enables layer-wise diagnostics, convergence-based early stopping, and detection of structural errors, making transformer analysis interpretable, actionable, and reproducible.
title	TRACE: Training and Inference-Time Interpretability Analysis for Language Models
topic	Computation and Language
url	https://arxiv.org/abs/2507.03668

Ähnliche Einträge