Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2507.04239 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866912468487372800 |
|---|---|
| author | Gelada, Carles Buckman, Jacob Zhang, Sean Bach, Txus |
| author_facet | Gelada, Carles Buckman, Jacob Zhang, Sean Bach, Txus |
| contents | We argue that neither transformers nor sub-quadratic architectures are well suited to training at long sequence lengths: the cost of processing the context is too expensive in the former, too inexpensive in the latter. Approaches such as sliding window attention which reduce the cost-per-token of a transformer impair in-context learning, and so are also unsuitable. To address these limitations, we introduce power attention, an architectural layer for linear-cost sequence modeling whose state size can be adjusted independently of parameters, unlocking the advantages of linear attention on practical domains. We develop and open-source a set of GPU kernels for efficient power attention, identifying a novel pattern of operation fusion to avoid memory and bandwidth bottlenecks. Our experiments on the in-context learning of power attention shows that these models dominate both exponential attention and linear attention at long-context training. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2507_04239 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | Scaling Context Requires Rethinking Attention Gelada, Carles Buckman, Jacob Zhang, Sean Bach, Txus Machine Learning Artificial Intelligence We argue that neither transformers nor sub-quadratic architectures are well suited to training at long sequence lengths: the cost of processing the context is too expensive in the former, too inexpensive in the latter. Approaches such as sliding window attention which reduce the cost-per-token of a transformer impair in-context learning, and so are also unsuitable. To address these limitations, we introduce power attention, an architectural layer for linear-cost sequence modeling whose state size can be adjusted independently of parameters, unlocking the advantages of linear attention on practical domains. We develop and open-source a set of GPU kernels for efficient power attention, identifying a novel pattern of operation fusion to avoid memory and bandwidth bottlenecks. Our experiments on the in-context learning of power attention shows that these models dominate both exponential attention and linear attention at long-context training. |
| title | Scaling Context Requires Rethinking Attention |
| topic | Machine Learning Artificial Intelligence |
| url | https://arxiv.org/abs/2507.04239 |