Saved in:
| Main Authors: | Li, Weiqing, Jiang, Guochao, Ding, Xiangyong, Tao, Zhangcheng, Hao, Chuzhan, Xu, Chenfeng, Zhang, Yuewei, Wang, Hao |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2504.03775 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
TraCT: Disaggregated LLM Serving with CXL Shared Memory KV Cache at Rack-Scale
by: Yoon, Dongha, et al.
Published: (2025)
by: Yoon, Dongha, et al.
Published: (2025)
ForkKV: Scaling Multi-LoRA Agent Serving via Copy-on-Write Disaggregated KV Cache
by: Wang, Shao, et al.
Published: (2026)
by: Wang, Shao, et al.
Published: (2026)
KV Cache Compression for Inference Efficiency in LLMs: A Review
by: Liu, Yanyu, et al.
Published: (2025)
by: Liu, Yanyu, et al.
Published: (2025)
BanaServe: Unified KV Cache and Dynamic Module Migration for Balancing Disaggregated LLM Serving in AI Infrastructure
by: He, Yiyuan, et al.
Published: (2025)
by: He, Yiyuan, et al.
Published: (2025)
CacheFlow: Efficient LLM Serving with 3D-Parallel KV Cache Restoration
by: Nian, Sean, et al.
Published: (2026)
by: Nian, Sean, et al.
Published: (2026)
FairKV: Balancing Per-Head KV Cache for Fast Multi-GPU Inference
by: Zhao, Bingzhe, et al.
Published: (2025)
by: Zhao, Bingzhe, et al.
Published: (2025)
PackKV: Reducing KV Cache Memory Footprint through LLM-Aware Lossy Compression
by: Jiang, Bo, et al.
Published: (2025)
by: Jiang, Bo, et al.
Published: (2025)
FlexKV: Flexible Index Offloading for Memory-Disaggregated Key-Value Store
by: Hu, Zhisheng, et al.
Published: (2025)
by: Hu, Zhisheng, et al.
Published: (2025)
KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving
by: Liu, Zedong, et al.
Published: (2026)
by: Liu, Zedong, et al.
Published: (2026)
GORGO: Maximizing KV-Cache Reuse While Minimizing Network Latency in Cross-Region LLM Load Balancing
by: Toniolo, Alessio Ricci, et al.
Published: (2026)
by: Toniolo, Alessio Ricci, et al.
Published: (2026)
CascadeInfer: Length-Aware Scheduling of LLM Serving with Low Latency and Load Balancing
by: Yuan, Yitao, et al.
Published: (2025)
by: Yuan, Yitao, et al.
Published: (2025)
PolyKV: A Shared Asymmetrically-Compressed KV Cache Pool for Multi-Agent LLM Inference
by: Patel, Ishan, et al.
Published: (2026)
by: Patel, Ishan, et al.
Published: (2026)
Taming Request Imbalance: SLO-Aware Scheduling for Disaggregated LLM Inference
by: Wang, Qipeng
Published: (2026)
by: Wang, Qipeng
Published: (2026)
ContiguousKV: Accelerating LLM Prefill with Granularity-Aligned KV Cache Management
by: Zou, Jing, et al.
Published: (2026)
by: Zou, Jing, et al.
Published: (2026)
Leyline: KV Cache Directives for Agentic Inference
by: Ma, Bole, et al.
Published: (2026)
by: Ma, Bole, et al.
Published: (2026)
KVPR: Efficient LLM Inference with I/O-Aware KV Cache Partial Recomputation
by: Jiang, Chaoyi, et al.
Published: (2024)
by: Jiang, Chaoyi, et al.
Published: (2024)
KVSwap: Disk-aware KV Cache Offloading for Long-Context On-device Inference
by: Zhang, Huawei, et al.
Published: (2025)
by: Zhang, Huawei, et al.
Published: (2025)
RcLLM: Accelerating Generative Recommendation via Beyond-Prefix KV Caching
by: Zhao, Zhan, et al.
Published: (2026)
by: Zhao, Zhan, et al.
Published: (2026)
PiKV: KV Cache Management System for Mixture of Experts
by: Liu, Dong, et al.
Published: (2025)
by: Liu, Dong, et al.
Published: (2025)
Arrow: Adaptive Scheduling Mechanisms for Disaggregated LLM Inference Architecture
by: Wu, Yu, et al.
Published: (2025)
by: Wu, Yu, et al.
Published: (2025)
FastCache: Optimizing Multimodal LLM Serving through Lightweight KV-Cache Compression Framework
by: Zhu, Jianian, et al.
Published: (2025)
by: Zhu, Jianian, et al.
Published: (2025)
KVComp: A High-Performance, LLM-Aware, Lossy Compression Framework for KV Cache
by: Jiang, Bo, et al.
Published: (2025)
by: Jiang, Bo, et al.
Published: (2025)
TokenDance: Scaling Multi-Agent LLM Serving via Collective KV Cache Sharing
by: Bian, Zhuohang, et al.
Published: (2026)
by: Bian, Zhuohang, et al.
Published: (2026)
InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management
by: Lee, Wonbeom, et al.
Published: (2024)
by: Lee, Wonbeom, et al.
Published: (2024)
KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache Generation
by: Cho, Minsik, et al.
Published: (2024)
by: Cho, Minsik, et al.
Published: (2024)
StreamServe: Adaptive Speculative Flows for Low-Latency Disaggregated LLM Serving
by: Kumar, Satyam, et al.
Published: (2026)
by: Kumar, Satyam, et al.
Published: (2026)
Resident KV Claims: A Conformance Contract for Future Reuse under Active KV Pressure
by: Stepanek, Lukas
Published: (2026)
by: Stepanek, Lukas
Published: (2026)
Mell: Memory-Efficient Large Language Model Serving via Multi-GPU KV Cache Management
by: Qianli, Liu, et al.
Published: (2025)
by: Qianli, Liu, et al.
Published: (2025)
PrefillShare: A Shared Prefill Module for KV Reuse in Multi-LLM Disaggregated Serving
by: Woo, Sunghyeon, et al.
Published: (2026)
by: Woo, Sunghyeon, et al.
Published: (2026)
Efficient Remote KV Cache Reuse with GPU-native Video Codec
by: Mi, Liang, et al.
Published: (2026)
by: Mi, Liang, et al.
Published: (2026)
Adaptive KV Cache Reuse for Fast Long-Context LLM Serving
by: li, Fei, et al.
Published: (2026)
by: li, Fei, et al.
Published: (2026)
Hyperion: Low-Latency Ultra-HD Video Analytics via Collaborative Vision Transformer Inference
by: Jiang, Linyi, et al.
Published: (2025)
by: Jiang, Linyi, et al.
Published: (2025)
Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads
by: Hu, Cunchen, et al.
Published: (2024)
by: Hu, Cunchen, et al.
Published: (2024)
DUAL-BLADE: Dual-Path NVMe-Direct KV-Cache Offloading for Edge LLM Inference
by: Jeong, Bodon, et al.
Published: (2026)
by: Jeong, Bodon, et al.
Published: (2026)
Cost-Efficient LLM Serving in the Cloud: VM Selection with KV Cache Offloading
by: Kim, Kihyun, et al.
Published: (2025)
by: Kim, Kihyun, et al.
Published: (2025)
A Survey on Large Language Model Acceleration based on KV Cache Management
by: Li, Haoyang, et al.
Published: (2024)
by: Li, Haoyang, et al.
Published: (2024)
Adaptive Multi-Objective Tiered Storage Configuration for KV Cache in LLM Service
by: Zheng, Xianzhe, et al.
Published: (2026)
by: Zheng, Xianzhe, et al.
Published: (2026)
FineServe: Precision-Aware KV Slab and Two-Level Scheduling for Heterogeneous Precision LLM Serving
by: Bin, Kyungmin, et al.
Published: (2025)
by: Bin, Kyungmin, et al.
Published: (2025)
VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration
by: Tu, Dezhan, et al.
Published: (2024)
by: Tu, Dezhan, et al.
Published: (2024)
SplitZip: Ultra Fast Lossless KV Compression for Disaggregated LLM Serving
by: Guo, Yipin, et al.
Published: (2026)
by: Guo, Yipin, et al.
Published: (2026)
Similar Items
-
TraCT: Disaggregated LLM Serving with CXL Shared Memory KV Cache at Rack-Scale
by: Yoon, Dongha, et al.
Published: (2025) -
ForkKV: Scaling Multi-LoRA Agent Serving via Copy-on-Write Disaggregated KV Cache
by: Wang, Shao, et al.
Published: (2026) -
KV Cache Compression for Inference Efficiency in LLMs: A Review
by: Liu, Yanyu, et al.
Published: (2025) -
BanaServe: Unified KV Cache and Dynamic Module Migration for Balancing Disaggregated LLM Serving in AI Infrastructure
by: He, Yiyuan, et al.
Published: (2025) -
CacheFlow: Efficient LLM Serving with 3D-Parallel KV Cache Restoration
by: Nian, Sean, et al.
Published: (2026)