Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Bai, Yushi, Dong, Qian, Jiang, Ting, Lv, Xin, Du, Zhengxiao, Zeng, Aohan, Tang, Jie, Li, Juanzi
Format:	Preprint
Published:	2026
Subjects:	Computation and Language Machine Learning
Online Access:	https://arxiv.org/abs/2603.12201
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908882127814656
author	Bai, Yushi Dong, Qian Jiang, Ting Lv, Xin Du, Zhengxiao Zeng, Aohan Tang, Jie Li, Juanzi
author_facet	Bai, Yushi Dong, Qian Jiang, Ting Lv, Xin Du, Zhengxiao Zeng, Aohan Tang, Jie Li, Juanzi
contents	Long-context agentic workflows have emerged as a defining use case for large language models, making attention efficiency critical for both inference speed and serving cost. Sparse attention addresses this challenge effectively, and DeepSeek Sparse Attention (DSA) is a representative production-grade solution: a lightweight lightning indexer selects the top-k most relevant tokens per query, reducing core attention from $O(L^2)$ to $O(Lk)$. However, the indexer itself retains $O(L^2)$ complexity and must run independently at every layer, despite the fact that the resulting top-k selections are highly similar across consecutive layers. We present IndexCache, which exploits this cross-layer redundancy by partitioning layers into a small set of Full layers that run their own indexers and a majority of Shared layers that simply reuse the nearest Full layer's top-k indices. We propose two complementary approaches to determine and optimize this configuration. Training-free IndexCache applies a greedy search algorithm that selects which layers to retain indexers by directly minimizing language modeling loss on a calibration set, requiring no weight updates. Training-aware IndexCache introduces a multi-layer distillation loss that trains each retained indexer against the averaged attention distributions of all layers it serves, enabling even simple interleaved patterns to match full-indexer accuracy. Experimental results on a 30B DSA model show that IndexCache can remove 75% of indexer computations with negligible quality degradation, achieving up to 1.82$\times$ prefill speedup and 1.48$\times$ decode speedup compared to standard DSA. These positive results are further confirmed by our preliminary experiments on the production-scale GLM-5 model (Figure 1).
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_12201
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse Bai, Yushi Dong, Qian Jiang, Ting Lv, Xin Du, Zhengxiao Zeng, Aohan Tang, Jie Li, Juanzi Computation and Language Machine Learning Long-context agentic workflows have emerged as a defining use case for large language models, making attention efficiency critical for both inference speed and serving cost. Sparse attention addresses this challenge effectively, and DeepSeek Sparse Attention (DSA) is a representative production-grade solution: a lightweight lightning indexer selects the top-k most relevant tokens per query, reducing core attention from $O(L^2)$ to $O(Lk)$. However, the indexer itself retains $O(L^2)$ complexity and must run independently at every layer, despite the fact that the resulting top-k selections are highly similar across consecutive layers. We present IndexCache, which exploits this cross-layer redundancy by partitioning layers into a small set of Full layers that run their own indexers and a majority of Shared layers that simply reuse the nearest Full layer's top-k indices. We propose two complementary approaches to determine and optimize this configuration. Training-free IndexCache applies a greedy search algorithm that selects which layers to retain indexers by directly minimizing language modeling loss on a calibration set, requiring no weight updates. Training-aware IndexCache introduces a multi-layer distillation loss that trains each retained indexer against the averaged attention distributions of all layers it serves, enabling even simple interleaved patterns to match full-indexer accuracy. Experimental results on a 30B DSA model show that IndexCache can remove 75% of indexer computations with negligible quality degradation, achieving up to 1.82$\times$ prefill speedup and 1.48$\times$ decode speedup compared to standard DSA. These positive results are further confirmed by our preliminary experiments on the production-scale GLM-5 model (Figure 1).
title	IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse
topic	Computation and Language Machine Learning
url	https://arxiv.org/abs/2603.12201

Similar Items