Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Demir, Samet, Dogan, Zafer
Format:	Preprint
Published:	2025
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2509.15152
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912592874700800
author	Demir, Samet Dogan, Zafer
author_facet	Demir, Samet Dogan, Zafer
contents	We study the in-context learning (ICL) capabilities of pretrained Transformers in the setting of nonlinear regression. Specifically, we focus on a random Transformer with a nonlinear MLP head where the first layer is randomly initialized and fixed while the second layer is trained. Furthermore, we consider an asymptotic regime where the context length, input dimension, hidden dimension, number of training tasks, and number of training samples jointly grow. In this setting, we show that the random Transformer behaves equivalent to a finite-degree Hermite polynomial model in terms of ICL error. This equivalence is validated through simulations across varying activation functions, context lengths, hidden layer widths (revealing a double-descent phenomenon), and regularization settings. Our results offer theoretical and empirical insights into when and how MLP layers enhance ICL, and how nonlinearity and over-parameterization influence model performance.
format	Preprint
id	arxiv_https___arxiv_org_abs_2509_15152
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Asymptotic Study of In-context Learning with Random Transformers through Equivalent Models Demir, Samet Dogan, Zafer Machine Learning We study the in-context learning (ICL) capabilities of pretrained Transformers in the setting of nonlinear regression. Specifically, we focus on a random Transformer with a nonlinear MLP head where the first layer is randomly initialized and fixed while the second layer is trained. Furthermore, we consider an asymptotic regime where the context length, input dimension, hidden dimension, number of training tasks, and number of training samples jointly grow. In this setting, we show that the random Transformer behaves equivalent to a finite-degree Hermite polynomial model in terms of ICL error. This equivalence is validated through simulations across varying activation functions, context lengths, hidden layer widths (revealing a double-descent phenomenon), and regularization settings. Our results offer theoretical and empirical insights into when and how MLP layers enhance ICL, and how nonlinearity and over-parameterization influence model performance.
title	Asymptotic Study of In-context Learning with Random Transformers through Equivalent Models
topic	Machine Learning
url	https://arxiv.org/abs/2509.15152

Similar Items