Saved in:
Bibliographic Details
Main Authors: Xu, Yufei, Meng, Fanxu, Jiang, Fan, Wang, Yuxuan, Zhou, Ruijie, Wang, Zhaohui, Wu, Jiexi, Pan, Zhixin, Tang, Xiaojuan, Pei, Wenjie, Liu, Tongxuan, Yin, Di, Sun, Xing, Zhang, Muhan
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2603.28458
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866913006650130432
author Xu, Yufei
Meng, Fanxu
Jiang, Fan
Wang, Yuxuan
Zhou, Ruijie
Wang, Zhaohui
Wu, Jiexi
Pan, Zhixin
Tang, Xiaojuan
Pei, Wenjie
Liu, Tongxuan
Yin, Di
Sun, Xing
Zhang, Muhan
author_facet Xu, Yufei
Meng, Fanxu
Jiang, Fan
Wang, Yuxuan
Zhou, Ruijie
Wang, Zhaohui
Wu, Jiexi
Pan, Zhixin
Tang, Xiaojuan
Pei, Wenjie
Liu, Tongxuan
Yin, Di
Sun, Xing
Zhang, Muhan
contents Token-level sparse attention mechanisms, exemplified by DeepSeek Sparse Attention (DSA), achieve fine-grained key selection by scoring every historical key for each query through a lightweight indexer, then computing attention only on the selected subset. While the downstream sparse attention itself scales favorably, the indexer must still scan the entire prefix for every query, introducing an per-layer bottleneck that grows prohibitively with context length. We propose HISA (Hierarchical Indexed Sparse Attention), a plug-and-play replacement for the indexer that rewrites the search path from a flat token scan into a two-stage hierarchical procedure: (1) a block-level coarse filtering stage that scores pooled block representations to discard irrelevant regions, followed by (2) a token-level refinement stage that applies the original indexer exclusively within the retained candidate blocks. HISA preserves the identical token-level top-sparse pattern consumed by the downstream Sparse MLA operator and requires no additional training. On kernel-level benchmarks, HISA achieves up to speedup at 64K context. On Needle-in-a-Haystack and LongBench, we directly replace the indexer in DeepSeek-V3.2 and GLM-5 with our HISA indexer, without any finetuning. HISA closely matches the original DSA in quality, while substantially outperforming block-sparse baselines.
format Preprint
id arxiv_https___arxiv_org_abs_2603_28458
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle HISA: Efficient Hierarchical Indexing for Fine-Grained Sparse Attention
Xu, Yufei
Meng, Fanxu
Jiang, Fan
Wang, Yuxuan
Zhou, Ruijie
Wang, Zhaohui
Wu, Jiexi
Pan, Zhixin
Tang, Xiaojuan
Pei, Wenjie
Liu, Tongxuan
Yin, Di
Sun, Xing
Zhang, Muhan
Machine Learning
Artificial Intelligence
Token-level sparse attention mechanisms, exemplified by DeepSeek Sparse Attention (DSA), achieve fine-grained key selection by scoring every historical key for each query through a lightweight indexer, then computing attention only on the selected subset. While the downstream sparse attention itself scales favorably, the indexer must still scan the entire prefix for every query, introducing an per-layer bottleneck that grows prohibitively with context length. We propose HISA (Hierarchical Indexed Sparse Attention), a plug-and-play replacement for the indexer that rewrites the search path from a flat token scan into a two-stage hierarchical procedure: (1) a block-level coarse filtering stage that scores pooled block representations to discard irrelevant regions, followed by (2) a token-level refinement stage that applies the original indexer exclusively within the retained candidate blocks. HISA preserves the identical token-level top-sparse pattern consumed by the downstream Sparse MLA operator and requires no additional training. On kernel-level benchmarks, HISA achieves up to speedup at 64K context. On Needle-in-a-Haystack and LongBench, we directly replace the indexer in DeepSeek-V3.2 and GLM-5 with our HISA indexer, without any finetuning. HISA closely matches the original DSA in quality, while substantially outperforming block-sparse baselines.
title HISA: Efficient Hierarchical Indexing for Fine-Grained Sparse Attention
topic Machine Learning
Artificial Intelligence
url https://arxiv.org/abs/2603.28458