Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhao, Zihan, Lu, Baotong, Lin, Shengjie, Chen, Yizou, Liu, Jing, Zhang, Yanqi, Miao, Ziming, Yang, Ming-Chang, Shen, Haiying, Chen, Qi, Yang, Fan
Format:	Preprint
Published:	2026
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2604.26837
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911633245208576
author	Zhao, Zihan Lu, Baotong Lin, Shengjie Chen, Yizou Liu, Jing Zhang, Yanqi Miao, Ziming Yang, Ming-Chang Shen, Haiying Chen, Qi Yang, Fan
author_facet	Zhao, Zihan Lu, Baotong Lin, Shengjie Chen, Yizou Liu, Jing Zhang, Yanqi Miao, Ziming Yang, Ming-Chang Shen, Haiying Chen, Qi Yang, Fan
contents	Long-context LLM serving is bottlenecked by the cost of attending over ever-growing KV caches. Dynamic sparse attention promises relief by accessing only a small, query-dependent subset of the KV state per decoding step and extending the KV storage to CPU memory. In practice, however, these algorithmic savings rarely translate into end-to-end system-level gains because sparse methods typically operate at different granularities and thus rely on ad hoc, per-algorithm implementations. At the same time, hierarchical KV storage introduces a new systems bottleneck: retrieving fine-grained, irregular KV subsets across the GPU-CPU boundary can easily erase the benefits of sparsity. We present SPIN, a sparse-attention-aware inference framework that co-designs the execution pipeline with hierarchical KV storage through three techniques: (1) a unified partition abstraction that maps different sparsity granularities onto a shared page-based KV substrate; (2) a locality-aware KV cache manager that dynamically sizes per-request HBM budgets and uses a GPU-friendly bucketed LRU policy to cut PCIe round-trips; and (3) a two-level hierarchical metadata layout sized to the active working set rather than the worst-case address space. Built on vLLM with three representative sparse attention algorithms, SPIN delivers 1.66-5.66x higher end-to-end throughput and 7-9x lower TTFT than vLLM, and reduces TPOT by up to 58% over the original sparse-attention implementations.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_26837
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving Zhao, Zihan Lu, Baotong Lin, Shengjie Chen, Yizou Liu, Jing Zhang, Yanqi Miao, Ziming Yang, Ming-Chang Shen, Haiying Chen, Qi Yang, Fan Machine Learning Long-context LLM serving is bottlenecked by the cost of attending over ever-growing KV caches. Dynamic sparse attention promises relief by accessing only a small, query-dependent subset of the KV state per decoding step and extending the KV storage to CPU memory. In practice, however, these algorithmic savings rarely translate into end-to-end system-level gains because sparse methods typically operate at different granularities and thus rely on ad hoc, per-algorithm implementations. At the same time, hierarchical KV storage introduces a new systems bottleneck: retrieving fine-grained, irregular KV subsets across the GPU-CPU boundary can easily erase the benefits of sparsity. We present SPIN, a sparse-attention-aware inference framework that co-designs the execution pipeline with hierarchical KV storage through three techniques: (1) a unified partition abstraction that maps different sparsity granularities onto a shared page-based KV substrate; (2) a locality-aware KV cache manager that dynamically sizes per-request HBM budgets and uses a GPU-friendly bucketed LRU policy to cut PCIe round-trips; and (3) a two-level hierarchical metadata layout sized to the active working set rather than the worst-case address space. Built on vLLM with three representative sparse attention algorithms, SPIN delivers 1.66-5.66x higher end-to-end throughput and 7-9x lower TTFT than vLLM, and reduces TPOT by up to 58% over the original sparse-attention implementations.
title	Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving
topic	Machine Learning
url	https://arxiv.org/abs/2604.26837

Similar Items