Gespeichert in:
| Hauptverfasser: | , , , |
|---|---|
| Format: | Preprint |
| Veröffentlicht: |
2025
|
| Schlagworte: | |
| Online-Zugang: | https://arxiv.org/abs/2504.19901 |
| Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
| _version_ | 1866912350576050176 |
|---|---|
| author | Liu, Hude Hu, Jerry Yao-Chieh Song, Zhao Liu, Han |
| author_facet | Liu, Hude Hu, Jerry Yao-Chieh Song, Zhao Liu, Han |
| contents | We establish the universal approximation capability of single-layer, single-head self- and cross-attention mechanisms with minimal attached structures. Our key insight is to interpret single-head attention as an input domain-partition mechanism that assigns distinct values to subregions. This allows us to engineer the attention weights such that this assignment imitates the target function. Building on this, we prove that a single self-attention layer, preceded by sum-of-linear transformations, is capable of approximating any continuous function on a compact domain under the $L_\infty$-norm. Furthermore, we extend this construction to approximate any Lebesgue integrable function under $L_p$-norm for $1\leq p <\infty$. Lastly, we also extend our techniques and show that, for the first time, single-head cross-attention achieves the same universal approximation guarantees. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2504_19901 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | Attention Mechanism, Max-Affine Partition, and Universal Approximation Liu, Hude Hu, Jerry Yao-Chieh Song, Zhao Liu, Han Machine Learning Artificial Intelligence We establish the universal approximation capability of single-layer, single-head self- and cross-attention mechanisms with minimal attached structures. Our key insight is to interpret single-head attention as an input domain-partition mechanism that assigns distinct values to subregions. This allows us to engineer the attention weights such that this assignment imitates the target function. Building on this, we prove that a single self-attention layer, preceded by sum-of-linear transformations, is capable of approximating any continuous function on a compact domain under the $L_\infty$-norm. Furthermore, we extend this construction to approximate any Lebesgue integrable function under $L_p$-norm for $1\leq p <\infty$. Lastly, we also extend our techniques and show that, for the first time, single-head cross-attention achieves the same universal approximation guarantees. |
| title | Attention Mechanism, Max-Affine Partition, and Universal Approximation |
| topic | Machine Learning Artificial Intelligence |
| url | https://arxiv.org/abs/2504.19901 |