Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhao, Mingkuan, Hu, Wentao, Wang, Jiayin, Lai, Xin, Huang, Tianchen, Min, Yuheng, Yan, Rui, Zhu, Xiaoyan
Format:	Preprint
Published:	2025
Subjects:	Machine Learning 68T50 (Primary) I.2.7
Online Access:	https://arxiv.org/abs/2511.09596
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908678669467648
author	Zhao, Mingkuan Hu, Wentao Wang, Jiayin Lai, Xin Huang, Tianchen Min, Yuheng Yan, Rui Zhu, Xiaoyan
author_facet	Zhao, Mingkuan Hu, Wentao Wang, Jiayin Lai, Xin Huang, Tianchen Min, Yuheng Yan, Rui Zhu, Xiaoyan
contents	The design of Large Language Models (LLMs) has long been hampered by a fundamental conflict within their core attention mechanism: its remarkable expressivity is built upon a computational complexity of O(H N^2) that grows quadratically with the context size (N) and linearly with the number of heads (H). This standard implementation harbors significant computational redundancy, as all heads independently compute attention over the same sequence space. Existing sparse methods, meanwhile, often trade information integrity for computational efficiency. To resolve this efficiency-performance trade-off, we propose SPAttention, whose core contribution is the introduction of a new paradigm we term Principled Structural Sparsity. SPAttention does not merely drop connections but instead reorganizes the computational task by partitioning the total attention workload into balanced, non-overlapping distance bands, assigning each head a unique segment. This approach transforms the multi-head attention mechanism from H independent O(N^2) computations into a single, collaborative O(N^2) computation, fundamentally reducing complexity by a factor of H. The structured inductive bias compels functional specialization among heads, enabling a more efficient allocation of computational resources from redundant modeling to distinct dependencies across the entire sequence span. Our work demonstrates that thoughtfully designed structural sparsity can serve as an effective inductive bias that simultaneously improves both computational efficiency and model performance, opening a new avenue for the architectural design of next-generation, high-performance LLMs.
format	Preprint
id	arxiv_https___arxiv_org_abs_2511_09596
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Making Every Head Count: Sparse Attention Without the Speed-Performance Trade-off Zhao, Mingkuan Hu, Wentao Wang, Jiayin Lai, Xin Huang, Tianchen Min, Yuheng Yan, Rui Zhu, Xiaoyan Machine Learning 68T50 (Primary) I.2.7 The design of Large Language Models (LLMs) has long been hampered by a fundamental conflict within their core attention mechanism: its remarkable expressivity is built upon a computational complexity of O(H N^2) that grows quadratically with the context size (N) and linearly with the number of heads (H). This standard implementation harbors significant computational redundancy, as all heads independently compute attention over the same sequence space. Existing sparse methods, meanwhile, often trade information integrity for computational efficiency. To resolve this efficiency-performance trade-off, we propose SPAttention, whose core contribution is the introduction of a new paradigm we term Principled Structural Sparsity. SPAttention does not merely drop connections but instead reorganizes the computational task by partitioning the total attention workload into balanced, non-overlapping distance bands, assigning each head a unique segment. This approach transforms the multi-head attention mechanism from H independent O(N^2) computations into a single, collaborative O(N^2) computation, fundamentally reducing complexity by a factor of H. The structured inductive bias compels functional specialization among heads, enabling a more efficient allocation of computational resources from redundant modeling to distinct dependencies across the entire sequence span. Our work demonstrates that thoughtfully designed structural sparsity can serve as an effective inductive bias that simultaneously improves both computational efficiency and model performance, opening a new avenue for the architectural design of next-generation, high-performance LLMs.
title	Making Every Head Count: Sparse Attention Without the Speed-Performance Trade-off
topic	Machine Learning 68T50 (Primary) I.2.7
url	https://arxiv.org/abs/2511.09596

Similar Items