Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Chen, Zheng-An, Luo, Tao
Format:	Preprint
Published:	2025
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2510.06954
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916997221056512
author	Chen, Zheng-An Luo, Tao
author_facet	Chen, Zheng-An Luo, Tao
contents	Although transformer-based models have shown exceptional empirical performance, the fundamental principles governing their training dynamics are inadequately characterized beyond configuration-specific studies. Inspired by empirical evidence showing improved reasoning capabilities under small initialization scales in language models, we employ the gradient flow analytical framework established in [Zhou et al. NeurIPS 2022] to systematically investigate linearized Transformer training dynamics. Our theoretical analysis dissects the dynamics of attention modules into two distinct stages. In the first stage, asymmetric weight perturbations from random initialization sustain non-degenerate gradient dynamics in parameter matrices, facilitating systematic escape from small initialization regimes. Subsequently, these matrices undergo condensation, progressively aligning toward the target orientation. In the second stage, the previously static key-query matrices actively participate in training, driving the normalized matrices toward asymptotic rank collapse. This two-stage framework generalizes classical directional convergence results.
format	Preprint
id	arxiv_https___arxiv_org_abs_2510_06954
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	From Condensation to Rank Collapse: A Two-Stage Analysis of Transformer Training Dynamics Chen, Zheng-An Luo, Tao Machine Learning Although transformer-based models have shown exceptional empirical performance, the fundamental principles governing their training dynamics are inadequately characterized beyond configuration-specific studies. Inspired by empirical evidence showing improved reasoning capabilities under small initialization scales in language models, we employ the gradient flow analytical framework established in [Zhou et al. NeurIPS 2022] to systematically investigate linearized Transformer training dynamics. Our theoretical analysis dissects the dynamics of attention modules into two distinct stages. In the first stage, asymmetric weight perturbations from random initialization sustain non-degenerate gradient dynamics in parameter matrices, facilitating systematic escape from small initialization regimes. Subsequently, these matrices undergo condensation, progressively aligning toward the target orientation. In the second stage, the previously static key-query matrices actively participate in training, driving the normalized matrices toward asymptotic rank collapse. This two-stage framework generalizes classical directional convergence results.
title	From Condensation to Rank Collapse: A Two-Stage Analysis of Transformer Training Dynamics
topic	Machine Learning
url	https://arxiv.org/abs/2510.06954

Similar Items