Saved in:
Bibliographic Details
Main Authors: Chen, Zheng-An, Luo, Tao
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2510.06954
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866916997221056512
author Chen, Zheng-An
Luo, Tao
author_facet Chen, Zheng-An
Luo, Tao
contents Although transformer-based models have shown exceptional empirical performance, the fundamental principles governing their training dynamics are inadequately characterized beyond configuration-specific studies. Inspired by empirical evidence showing improved reasoning capabilities under small initialization scales in language models, we employ the gradient flow analytical framework established in [Zhou et al. NeurIPS 2022] to systematically investigate linearized Transformer training dynamics. Our theoretical analysis dissects the dynamics of attention modules into two distinct stages. In the first stage, asymmetric weight perturbations from random initialization sustain non-degenerate gradient dynamics in parameter matrices, facilitating systematic escape from small initialization regimes. Subsequently, these matrices undergo condensation, progressively aligning toward the target orientation. In the second stage, the previously static key-query matrices actively participate in training, driving the normalized matrices toward asymptotic rank collapse. This two-stage framework generalizes classical directional convergence results.
format Preprint
id arxiv_https___arxiv_org_abs_2510_06954
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle From Condensation to Rank Collapse: A Two-Stage Analysis of Transformer Training Dynamics
Chen, Zheng-An
Luo, Tao
Machine Learning
Although transformer-based models have shown exceptional empirical performance, the fundamental principles governing their training dynamics are inadequately characterized beyond configuration-specific studies. Inspired by empirical evidence showing improved reasoning capabilities under small initialization scales in language models, we employ the gradient flow analytical framework established in [Zhou et al. NeurIPS 2022] to systematically investigate linearized Transformer training dynamics. Our theoretical analysis dissects the dynamics of attention modules into two distinct stages. In the first stage, asymmetric weight perturbations from random initialization sustain non-degenerate gradient dynamics in parameter matrices, facilitating systematic escape from small initialization regimes. Subsequently, these matrices undergo condensation, progressively aligning toward the target orientation. In the second stage, the previously static key-query matrices actively participate in training, driving the normalized matrices toward asymptotic rank collapse. This two-stage framework generalizes classical directional convergence results.
title From Condensation to Rank Collapse: A Two-Stage Analysis of Transformer Training Dynamics
topic Machine Learning
url https://arxiv.org/abs/2510.06954