Saved in:
| Main Authors: | Zhao, Zihan, Lu, Baotong, Lin, Shengjie, Chen, Yizou, Liu, Jing, Zhang, Yanqi, Miao, Ziming, Yang, Ming-Chang, Shen, Haiying, Chen, Qi, Yang, Fan |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.26837 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
SparseServe: Unlocking Parallelism for Dynamic Sparse Attention in Long-Context LLM Serving
by: Zhou, Qihui, et al.
Published: (2025)
by: Zhou, Qihui, et al.
Published: (2025)
LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention
by: Yang, Shang, et al.
Published: (2025)
by: Yang, Shang, et al.
Published: (2025)
RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval
by: Liu, Di, et al.
Published: (2024)
by: Liu, Di, et al.
Published: (2024)
RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference
by: Chen, Yaoqi, et al.
Published: (2025)
by: Chen, Yaoqi, et al.
Published: (2025)
EconoServe: Maximizing Multi-Resource Utilization with SLO Guarantees in LLM Serving
by: Shen, Haiying, et al.
Published: (2024)
by: Shen, Haiying, et al.
Published: (2024)
Lotus: Optimizing Disaggregated Transactions with Disaggregated Locks
by: Hu, Zhisheng, et al.
Published: (2025)
by: Hu, Zhisheng, et al.
Published: (2025)
SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention
by: Zhu, Qianchao, et al.
Published: (2024)
by: Zhu, Qianchao, et al.
Published: (2024)
AccelGen: Heterogeneous SLO-Guaranteed High-Throughput LLM Inference Serving for Diverse Applications
by: Shen, Haiying, et al.
Published: (2025)
by: Shen, Haiying, et al.
Published: (2025)
Prefill-Decode Aggregation or Disaggregation? Unifying Both for Goodput-Optimized LLM Serving
by: Wang, Chao, et al.
Published: (2025)
by: Wang, Chao, et al.
Published: (2025)
HiCI: Hierarchical Construction-Integration for Long-Context Attention
by: Zeng, Xiangyu, et al.
Published: (2026)
by: Zeng, Xiangyu, et al.
Published: (2026)
ModServe: Modality- and Stage-Aware Resource Disaggregation for Scalable Multimodal Model Serving
by: Qiu, Haoran, et al.
Published: (2025)
by: Qiu, Haoran, et al.
Published: (2025)
MorphServe: Efficient and Workload-Aware LLM Serving via Runtime Quantized Layer Swapping and KV Cache Resizing
by: Su, Zhaoyuan, et al.
Published: (2025)
by: Su, Zhaoyuan, et al.
Published: (2025)
MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool
by: Hu, Cunchen, et al.
Published: (2024)
by: Hu, Cunchen, et al.
Published: (2024)
Apt-Serve: Adaptive Request Scheduling on Hybrid Cache for Scalable LLM Inference Serving
by: Gao, Shihong, et al.
Published: (2025)
by: Gao, Shihong, et al.
Published: (2025)
VecAttention: Vector-wise Sparse Attention for Accelerating Long Context Inference
by: Liu, Anmin, et al.
Published: (2026)
by: Liu, Anmin, et al.
Published: (2026)
Strata: Hierarchical Context Caching for Long Context Language Model Serving
by: Xie, Zhiqiang, et al.
Published: (2025)
by: Xie, Zhiqiang, et al.
Published: (2025)
L3: DIMM-PIM Integrated Architecture and Coordination for Scalable Long-Context LLM Inference
by: Liu, Qingyuan, et al.
Published: (2025)
by: Liu, Qingyuan, et al.
Published: (2025)
Hardware-aligned Hierarchical Sparse Attention for Efficient Long-term Memory Access
by: Hu, Xiang, et al.
Published: (2025)
by: Hu, Xiang, et al.
Published: (2025)
Double-P: Hierarchical Top-P Sparse Attention for Long-Context LLMs
by: Ni, Wentao, et al.
Published: (2026)
by: Ni, Wentao, et al.
Published: (2026)
The CAP Principle for LLM Serving: A Survey of Long-Context Large Language Model Serving
by: Zeng, Pai, et al.
Published: (2024)
by: Zeng, Pai, et al.
Published: (2024)
KunServe: Parameter-centric Memory Management for Efficient Memory Overloading Handling in LLM Serving
by: Cheng, Rongxin, et al.
Published: (2024)
by: Cheng, Rongxin, et al.
Published: (2024)
MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference
by: Zhou, Ruijie, et al.
Published: (2026)
by: Zhou, Ruijie, et al.
Published: (2026)
Long-Context Generalization with Sparse Attention
by: Vasylenko, Pavlo, et al.
Published: (2025)
by: Vasylenko, Pavlo, et al.
Published: (2025)
DEX: Scalable Range Indexing on Disaggregated Memory [Extended Version]
by: Lu, Baotong, et al.
Published: (2024)
by: Lu, Baotong, et al.
Published: (2024)
AB-Sparse: Sparse Attention with Adaptive Block Size for Accurate and Efficient Long-Context Inference
by: Liu, Di, et al.
Published: (2026)
by: Liu, Di, et al.
Published: (2026)
C2CServe: Leveraging NVLink-C2C for Elastic Serverless LLM Serving on MIG
by: Luo, Shutian, et al.
Published: (2026)
by: Luo, Shutian, et al.
Published: (2026)
VSPrefill: Vertical-Slash Sparse Attention with Lightweight Indexing for Long-Context Prefilling
by: Guanzhong, Chen
Published: (2026)
by: Guanzhong, Chen
Published: (2026)
Pancake: Hierarchical Memory System for Multi-Agent LLM Serving
by: Hu, Zhengding, et al.
Published: (2026)
by: Hu, Zhengding, et al.
Published: (2026)
Performance and mildness of alkyl glycoside hydroxypropyl sulfonate
by: Kuan Chang, et al.
Published: (2024)
by: Kuan Chang, et al.
Published: (2024)
Gated Sparse Attention: Combining Computational Efficiency with Training Stability for Long-Context Language Models
by: Shen, Alfred, et al.
Published: (2026)
by: Shen, Alfred, et al.
Published: (2026)
Tutti: Making SSD-Backed KV Cache Practical for Long-Context LLM Serving
by: Qiu, Shi, et al.
Published: (2026)
by: Qiu, Shi, et al.
Published: (2026)
Progressive Sparse Attention: Algorithm and System Co-design for Efficient Attention in LLM Serving
by: Zhou, Qihui, et al.
Published: (2025)
by: Zhou, Qihui, et al.
Published: (2025)
FAST-Prefill: FPGA Accelerated Sparse Attention for Long Context LLM Prefill
by: Jayanth, Rakshith, et al.
Published: (2026)
by: Jayanth, Rakshith, et al.
Published: (2026)
Kascade: A Practical Sparse Attention Method for Long-Context LLM Inference
by: Deshmukh, Dhruv, et al.
Published: (2025)
by: Deshmukh, Dhruv, et al.
Published: (2025)
Veda: Scalable Video Diffusion via Distilled Sparse Attention
by: Han, Shihao, et al.
Published: (2026)
by: Han, Shihao, et al.
Published: (2026)
Towards Efficient and Scalable Distributed Vector Search with RDMA
by: Zhi, Xiangyu, et al.
Published: (2025)
by: Zhi, Xiangyu, et al.
Published: (2025)
HiMem: Hierarchical Long-Term Memory for LLM Long-Horizon Agents
by: Zhang, Ningning, et al.
Published: (2026)
by: Zhang, Ningning, et al.
Published: (2026)
Memory as Asset: From Agent-centric to Human-centric Memory Management
by: Pan, Yanqi, et al.
Published: (2026)
by: Pan, Yanqi, et al.
Published: (2026)
Scaling Long-Horizon LLM Agent via Context-Folding
by: Sun, Weiwei, et al.
Published: (2025)
by: Sun, Weiwei, et al.
Published: (2025)
HieraSparse: Hierarchical Semi-Structured Sparse KV Attention
by: Wang, Haoxuan, et al.
Published: (2026)
by: Wang, Haoxuan, et al.
Published: (2026)
Similar Items
-
SparseServe: Unlocking Parallelism for Dynamic Sparse Attention in Long-Context LLM Serving
by: Zhou, Qihui, et al.
Published: (2025) -
LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention
by: Yang, Shang, et al.
Published: (2025) -
RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval
by: Liu, Di, et al.
Published: (2024) -
RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference
by: Chen, Yaoqi, et al.
Published: (2025) -
EconoServe: Maximizing Multi-Resource Utilization with SLO Guarantees in LLM Serving
by: Shen, Haiying, et al.
Published: (2024)