MARC21: :: Library Catalog

Salvato in:

Dettagli Bibliografici
Autori principali:	Yang, Xu, Zhang, Jiapeng, Zhao, Dongyang, Chen, Guo, Tang, Zhuo
Natura:	Preprint
Pubblicazione:	2026
Soggetti:	Machine Learning Artificial Intelligence
Accesso online:	https://arxiv.org/abs/2603.14224
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

_version_	1866911517621878784
author	Yang, Xu Zhang, Jiapeng Zhao, Dongyang Chen, Guo Tang, Zhuo
author_facet	Yang, Xu Zhang, Jiapeng Zhao, Dongyang Chen, Guo Tang, Zhuo
contents	The KV cache in self-attention has emerged as a major bottleneck in long-context and large-batch inference for LLMs. Existing approaches often treat sparsity prediction and compression as separate modules, relying on auxiliary index structures to select relevant tokens, and on complex quantization schemes to reduce memory usage. This fragmented design introduces redundant overhead and limits scalability. In this paper, we propose a novel paradigm: treating the compressed key representation not merely as storage, but as a self-indexing structure that directly enables efficient sparse attention. By designing a sign-based 1-bit vector quantization (VQ) scheme, our method unifies compression and retrieval in a single, hardware-friendly format. This approach eliminates the need for external indices or learning-based predictors, offering a lightweight yet robust solution for memory-constrained inference. All components are designed to be hardware-efficient and easy to implement. By implementing custom CUDA kernels, our method integrates seamlessly with FlashAttention, minimizing additional runtime and memory overhead. Experimental results demonstrate that our approach delivers both effectiveness and efficiency.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_14224
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Self-Indexing KVCache: Predicting Sparse Attention from Compressed Keys Yang, Xu Zhang, Jiapeng Zhao, Dongyang Chen, Guo Tang, Zhuo Machine Learning Artificial Intelligence The KV cache in self-attention has emerged as a major bottleneck in long-context and large-batch inference for LLMs. Existing approaches often treat sparsity prediction and compression as separate modules, relying on auxiliary index structures to select relevant tokens, and on complex quantization schemes to reduce memory usage. This fragmented design introduces redundant overhead and limits scalability. In this paper, we propose a novel paradigm: treating the compressed key representation not merely as storage, but as a self-indexing structure that directly enables efficient sparse attention. By designing a sign-based 1-bit vector quantization (VQ) scheme, our method unifies compression and retrieval in a single, hardware-friendly format. This approach eliminates the need for external indices or learning-based predictors, offering a lightweight yet robust solution for memory-constrained inference. All components are designed to be hardware-efficient and easy to implement. By implementing custom CUDA kernels, our method integrates seamlessly with FlashAttention, minimizing additional runtime and memory overhead. Experimental results demonstrate that our approach delivers both effectiveness and efficiency.
title	Self-Indexing KVCache: Predicting Sparse Attention from Compressed Keys
topic	Machine Learning Artificial Intelligence
url	https://arxiv.org/abs/2603.14224

Documenti analoghi