Saved in:
Bibliographic Details
Main Authors: Chen, Yaoqi, Zhang, Jinkai, Lu, Baotong, Zhang, Qianxi, Zhang, Chengruidong, Liu, Jing, Luo, Jingjia, Liu, Di, Jiang, Huiqiang, Chen, Qi, Ding, Bailu, Yan, Xiao, Jiang, Jiawei, Chen, Chen, Zhang, Mingxing, Li, Cheng, Yang, Yuqing, Yang, Fan, Yang, Mao
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2505.02922
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • Recent large language models (LLMs) are rapidly extending their context windows, yet inference throughput lags due to increasing GPU memory and bandwidth demands. This is because the key-value (KV) cache, an intermediate structure storing token representations, grows linearly with context length and requires an iterative linear scan for attention computation. A promising direction to accelerate long-context inference is to exploit attention's inherent sparsity by offloading the KV cache to CPU memory and retrieving only a small subset of tokens important to the current generation step. However, prior sparse attention approaches struggle to balance accuracy and retrieval cost due to varying sparsity patterns and inefficient GPU-CPU memory management. We present RetroInfer, a vector storage engine that realizes a sparsity-based KV cache for long-context inference. RetroInfer introduces an Attention-aWare VEctor index (wave index), which fundamentally improves the tradeoff between attention accuracy and retrieval cost through tripartite attention approximation, accuracy-bound attention estimation, and segmented clustering. We also design the wave buffer, a GPU-CPU buffer manager that assigns computation and manages data across heterogeneous hardware. We evaluate RetroInfer across a range of models and workloads, demonstrating up to 4.4X decoding throughput over full attention at 120K context and up to 12.2X over sparse attention baselines at 1 million tokens -- all while preserving full-attention-level accuracy.