Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhang, Chenyang, Zhao, Qingyue, Gu, Quanquan, Cao, Yuan
Format:	Preprint
Published:	2026
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2603.22801
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917359368798208
author	Zhang, Chenyang Zhao, Qingyue Gu, Quanquan Cao, Yuan
author_facet	Zhang, Chenyang Zhao, Qingyue Gu, Quanquan Cao, Yuan
contents	Transformers have achieved great success across a wide range of applications, yet the theoretical foundations underlying their success remain largely unexplored. To demystify the strong capacities of transformers applied to versatile scenarios and tasks, we theoretically investigate utilizing transformers as students to learn from a class of teacher models. Specifically, the teacher models covered in our analysis include convolution layers with average pooling, graph convolution layers, and various classic statistical learning models, including a variant of sparse token selection models [Sanford et al., 2023, Wang et al., 2024] and group-sparse linear predictors [Zhang et al., 2025]. When learning from this class of teacher models, we prove that one-layer transformers with simplified "position-only'' attention can successfully recover all parameter blocks of the teacher models, thus achieving the optimal population loss. Building upon the efficient mimicry of trained transformers towards teacher models, we further demonstrate that they can generalize well to a broad class of out-of-distribution data under mild assumptions. The key in our analysis is to identify a fundamental bilinear structure shared by various learning tasks, which enables us to establish unified learning guarantees for these tasks when treating them as teachers for transformers.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_22801
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Transformers Trained via Gradient Descent Can Provably Learn a Class of Teacher Models Zhang, Chenyang Zhao, Qingyue Gu, Quanquan Cao, Yuan Machine Learning Transformers have achieved great success across a wide range of applications, yet the theoretical foundations underlying their success remain largely unexplored. To demystify the strong capacities of transformers applied to versatile scenarios and tasks, we theoretically investigate utilizing transformers as students to learn from a class of teacher models. Specifically, the teacher models covered in our analysis include convolution layers with average pooling, graph convolution layers, and various classic statistical learning models, including a variant of sparse token selection models [Sanford et al., 2023, Wang et al., 2024] and group-sparse linear predictors [Zhang et al., 2025]. When learning from this class of teacher models, we prove that one-layer transformers with simplified "position-only'' attention can successfully recover all parameter blocks of the teacher models, thus achieving the optimal population loss. Building upon the efficient mimicry of trained transformers towards teacher models, we further demonstrate that they can generalize well to a broad class of out-of-distribution data under mild assumptions. The key in our analysis is to identify a fundamental bilinear structure shared by various learning tasks, which enables us to establish unified learning guarantees for these tasks when treating them as teachers for transformers.
title	Transformers Trained via Gradient Descent Can Provably Learn a Class of Teacher Models
topic	Machine Learning
url	https://arxiv.org/abs/2603.22801

Similar Items