:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zhao, Zihan, Lu, Baotong, Lin, Shengjie, Chen, Yizou, Liu, Jing, Zhang, Yanqi, Miao, Ziming, Yang, Ming-Chang, Shen, Haiying, Chen, Qi, Yang, Fan
Format:	Preprint
Published:	2026
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2604.26837
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

SparseServe: Unlocking Parallelism for Dynamic Sparse Attention in Long-Context LLM Serving
by: Zhou, Qihui, et al.
Published: (2025)

LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention
by: Yang, Shang, et al.
Published: (2025)

RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval
by: Liu, Di, et al.
Published: (2024)

RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference
by: Chen, Yaoqi, et al.
Published: (2025)

EconoServe: Maximizing Multi-Resource Utilization with SLO Guarantees in LLM Serving
by: Shen, Haiying, et al.
Published: (2024)

Lotus: Optimizing Disaggregated Transactions with Disaggregated Locks
by: Hu, Zhisheng, et al.
Published: (2025)

SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention
by: Zhu, Qianchao, et al.
Published: (2024)

AccelGen: Heterogeneous SLO-Guaranteed High-Throughput LLM Inference Serving for Diverse Applications
by: Shen, Haiying, et al.
Published: (2025)

Prefill-Decode Aggregation or Disaggregation? Unifying Both for Goodput-Optimized LLM Serving
by: Wang, Chao, et al.
Published: (2025)

HiCI: Hierarchical Construction-Integration for Long-Context Attention
by: Zeng, Xiangyu, et al.
Published: (2026)

ModServe: Modality- and Stage-Aware Resource Disaggregation for Scalable Multimodal Model Serving
by: Qiu, Haoran, et al.
Published: (2025)

MorphServe: Efficient and Workload-Aware LLM Serving via Runtime Quantized Layer Swapping and KV Cache Resizing
by: Su, Zhaoyuan, et al.
Published: (2025)

MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool
by: Hu, Cunchen, et al.
Published: (2024)

Apt-Serve: Adaptive Request Scheduling on Hybrid Cache for Scalable LLM Inference Serving
by: Gao, Shihong, et al.
Published: (2025)

VecAttention: Vector-wise Sparse Attention for Accelerating Long Context Inference
by: Liu, Anmin, et al.
Published: (2026)

Strata: Hierarchical Context Caching for Long Context Language Model Serving
by: Xie, Zhiqiang, et al.
Published: (2025)

L3: DIMM-PIM Integrated Architecture and Coordination for Scalable Long-Context LLM Inference
by: Liu, Qingyuan, et al.
Published: (2025)

Hardware-aligned Hierarchical Sparse Attention for Efficient Long-term Memory Access
by: Hu, Xiang, et al.
Published: (2025)

Double-P: Hierarchical Top-P Sparse Attention for Long-Context LLMs
by: Ni, Wentao, et al.
Published: (2026)

The CAP Principle for LLM Serving: A Survey of Long-Context Large Language Model Serving
by: Zeng, Pai, et al.
Published: (2024)

KunServe: Parameter-centric Memory Management for Efficient Memory Overloading Handling in LLM Serving
by: Cheng, Rongxin, et al.
Published: (2024)

MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference
by: Zhou, Ruijie, et al.
Published: (2026)

Long-Context Generalization with Sparse Attention
by: Vasylenko, Pavlo, et al.
Published: (2025)

DEX: Scalable Range Indexing on Disaggregated Memory [Extended Version]
by: Lu, Baotong, et al.
Published: (2024)

AB-Sparse: Sparse Attention with Adaptive Block Size for Accurate and Efficient Long-Context Inference
by: Liu, Di, et al.
Published: (2026)

C2CServe: Leveraging NVLink-C2C for Elastic Serverless LLM Serving on MIG
by: Luo, Shutian, et al.
Published: (2026)

VSPrefill: Vertical-Slash Sparse Attention with Lightweight Indexing for Long-Context Prefilling
by: Guanzhong, Chen
Published: (2026)

Pancake: Hierarchical Memory System for Multi-Agent LLM Serving
by: Hu, Zhengding, et al.
Published: (2026)

Performance and mildness of alkyl glycoside hydroxypropyl sulfonate
by: Kuan Chang, et al.
Published: (2024)

Gated Sparse Attention: Combining Computational Efficiency with Training Stability for Long-Context Language Models
by: Shen, Alfred, et al.
Published: (2026)

Tutti: Making SSD-Backed KV Cache Practical for Long-Context LLM Serving
by: Qiu, Shi, et al.
Published: (2026)

Progressive Sparse Attention: Algorithm and System Co-design for Efficient Attention in LLM Serving
by: Zhou, Qihui, et al.
Published: (2025)

FAST-Prefill: FPGA Accelerated Sparse Attention for Long Context LLM Prefill
by: Jayanth, Rakshith, et al.
Published: (2026)

Kascade: A Practical Sparse Attention Method for Long-Context LLM Inference
by: Deshmukh, Dhruv, et al.
Published: (2025)

Veda: Scalable Video Diffusion via Distilled Sparse Attention
by: Han, Shihao, et al.
Published: (2026)

Towards Efficient and Scalable Distributed Vector Search with RDMA
by: Zhi, Xiangyu, et al.
Published: (2025)

HiMem: Hierarchical Long-Term Memory for LLM Long-Horizon Agents
by: Zhang, Ningning, et al.
Published: (2026)

Memory as Asset: From Agent-centric to Human-centric Memory Management
by: Pan, Yanqi, et al.
Published: (2026)

Scaling Long-Horizon LLM Agent via Context-Folding
by: Sun, Weiwei, et al.
Published: (2025)

HieraSparse: Hierarchical Semi-Structured Sparse KV Attention
by: Wang, Haoxuan, et al.
Published: (2026)