Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Gelada, Carles, Buckman, Jacob, Zhang, Sean, Bach, Txus
Format:	Preprint
Published:	2025
Subjects:	Machine Learning Artificial Intelligence
Online Access:	https://arxiv.org/abs/2507.04239
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912468487372800
author	Gelada, Carles Buckman, Jacob Zhang, Sean Bach, Txus
author_facet	Gelada, Carles Buckman, Jacob Zhang, Sean Bach, Txus
contents	We argue that neither transformers nor sub-quadratic architectures are well suited to training at long sequence lengths: the cost of processing the context is too expensive in the former, too inexpensive in the latter. Approaches such as sliding window attention which reduce the cost-per-token of a transformer impair in-context learning, and so are also unsuitable. To address these limitations, we introduce power attention, an architectural layer for linear-cost sequence modeling whose state size can be adjusted independently of parameters, unlocking the advantages of linear attention on practical domains. We develop and open-source a set of GPU kernels for efficient power attention, identifying a novel pattern of operation fusion to avoid memory and bandwidth bottlenecks. Our experiments on the in-context learning of power attention shows that these models dominate both exponential attention and linear attention at long-context training.
format	Preprint
id	arxiv_https___arxiv_org_abs_2507_04239
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Scaling Context Requires Rethinking Attention Gelada, Carles Buckman, Jacob Zhang, Sean Bach, Txus Machine Learning Artificial Intelligence We argue that neither transformers nor sub-quadratic architectures are well suited to training at long sequence lengths: the cost of processing the context is too expensive in the former, too inexpensive in the latter. Approaches such as sliding window attention which reduce the cost-per-token of a transformer impair in-context learning, and so are also unsuitable. To address these limitations, we introduce power attention, an architectural layer for linear-cost sequence modeling whose state size can be adjusted independently of parameters, unlocking the advantages of linear attention on practical domains. We develop and open-source a set of GPU kernels for efficient power attention, identifying a novel pattern of operation fusion to avoid memory and bandwidth bottlenecks. Our experiments on the in-context learning of power attention shows that these models dominate both exponential attention and linear attention at long-context training.
title	Scaling Context Requires Rethinking Attention
topic	Machine Learning Artificial Intelligence
url	https://arxiv.org/abs/2507.04239

Similar Items