Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Liu, Hude, Hu, Jerry Yao-Chieh, Song, Zhao, Liu, Han
Format: Preprint
Veröffentlicht: 2025
Schlagworte:
Online-Zugang:https://arxiv.org/abs/2504.19901
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
_version_ 1866912350576050176
author Liu, Hude
Hu, Jerry Yao-Chieh
Song, Zhao
Liu, Han
author_facet Liu, Hude
Hu, Jerry Yao-Chieh
Song, Zhao
Liu, Han
contents We establish the universal approximation capability of single-layer, single-head self- and cross-attention mechanisms with minimal attached structures. Our key insight is to interpret single-head attention as an input domain-partition mechanism that assigns distinct values to subregions. This allows us to engineer the attention weights such that this assignment imitates the target function. Building on this, we prove that a single self-attention layer, preceded by sum-of-linear transformations, is capable of approximating any continuous function on a compact domain under the $L_\infty$-norm. Furthermore, we extend this construction to approximate any Lebesgue integrable function under $L_p$-norm for $1\leq p <\infty$. Lastly, we also extend our techniques and show that, for the first time, single-head cross-attention achieves the same universal approximation guarantees.
format Preprint
id arxiv_https___arxiv_org_abs_2504_19901
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Attention Mechanism, Max-Affine Partition, and Universal Approximation
Liu, Hude
Hu, Jerry Yao-Chieh
Song, Zhao
Liu, Han
Machine Learning
Artificial Intelligence
We establish the universal approximation capability of single-layer, single-head self- and cross-attention mechanisms with minimal attached structures. Our key insight is to interpret single-head attention as an input domain-partition mechanism that assigns distinct values to subregions. This allows us to engineer the attention weights such that this assignment imitates the target function. Building on this, we prove that a single self-attention layer, preceded by sum-of-linear transformations, is capable of approximating any continuous function on a compact domain under the $L_\infty$-norm. Furthermore, we extend this construction to approximate any Lebesgue integrable function under $L_p$-norm for $1\leq p <\infty$. Lastly, we also extend our techniques and show that, for the first time, single-head cross-attention achieves the same universal approximation guarantees.
title Attention Mechanism, Max-Affine Partition, and Universal Approximation
topic Machine Learning
Artificial Intelligence
url https://arxiv.org/abs/2504.19901