Saved in:
Bibliographic Details
Main Authors: Guo, Han, Yang, Songlin, Goel, Tarushii, Xing, Eric P., Dao, Tri, Kim, Yoon
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2506.04761
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866908857597427712
author Guo, Han
Yang, Songlin
Goel, Tarushii
Xing, Eric P.
Dao, Tri
Kim, Yoon
author_facet Guo, Han
Yang, Songlin
Goel, Tarushii
Xing, Eric P.
Dao, Tri
Kim, Yoon
contents The attention mechanism in Transformers is an important primitive for accurate and scalable sequence modeling. Its quadratic-compute and linear-memory complexity however remain significant bottlenecks. Linear attention and state-space models enable linear-time, constant-memory sequence modeling and can moreover be trained efficiently through matmul-rich parallelization across sequence length. However, at their core these models are still RNNs, and thus their use of a fixed-size hidden state to model the context is a fundamental limitation. This paper develops log-linear attention, an attention mechanism that balances linear attention's efficiency and the expressiveness of softmax attention. Log-linear attention replaces the fixed-size hidden state with a logarithmically growing set of hidden states. We show that with a particular growth function, log-linear attention admits a similarly matmul-rich parallel form whose compute cost is log-linear in sequence length. Log-linear attention is a general framework and can be applied on top of existing linear attention variants. As case studies, we instantiate log-linear variants of two recent architectures -- Mamba-2 and Gated DeltaNet -- and find they perform well compared to their linear-time variants.
format Preprint
id arxiv_https___arxiv_org_abs_2506_04761
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Log-Linear Attention
Guo, Han
Yang, Songlin
Goel, Tarushii
Xing, Eric P.
Dao, Tri
Kim, Yoon
Machine Learning
The attention mechanism in Transformers is an important primitive for accurate and scalable sequence modeling. Its quadratic-compute and linear-memory complexity however remain significant bottlenecks. Linear attention and state-space models enable linear-time, constant-memory sequence modeling and can moreover be trained efficiently through matmul-rich parallelization across sequence length. However, at their core these models are still RNNs, and thus their use of a fixed-size hidden state to model the context is a fundamental limitation. This paper develops log-linear attention, an attention mechanism that balances linear attention's efficiency and the expressiveness of softmax attention. Log-linear attention replaces the fixed-size hidden state with a logarithmically growing set of hidden states. We show that with a particular growth function, log-linear attention admits a similarly matmul-rich parallel form whose compute cost is log-linear in sequence length. Log-linear attention is a general framework and can be applied on top of existing linear attention variants. As case studies, we instantiate log-linear variants of two recent architectures -- Mamba-2 and Gated DeltaNet -- and find they perform well compared to their linear-time variants.
title Log-Linear Attention
topic Machine Learning
url https://arxiv.org/abs/2506.04761