:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Li, Weiqing, Jiang, Guochao, Ding, Xiangyong, Tao, Zhangcheng, Hao, Chuzhan, Xu, Chenfeng, Zhang, Yuewei, Wang, Hao
Format:	Preprint
Published:	2025
Subjects:	Distributed, Parallel, and Cluster Computing Artificial Intelligence Computation and Language
Online Access:	https://arxiv.org/abs/2504.03775
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

TraCT: Disaggregated LLM Serving with CXL Shared Memory KV Cache at Rack-Scale
by: Yoon, Dongha, et al.
Published: (2025)

ForkKV: Scaling Multi-LoRA Agent Serving via Copy-on-Write Disaggregated KV Cache
by: Wang, Shao, et al.
Published: (2026)

KV Cache Compression for Inference Efficiency in LLMs: A Review
by: Liu, Yanyu, et al.
Published: (2025)

BanaServe: Unified KV Cache and Dynamic Module Migration for Balancing Disaggregated LLM Serving in AI Infrastructure
by: He, Yiyuan, et al.
Published: (2025)

CacheFlow: Efficient LLM Serving with 3D-Parallel KV Cache Restoration
by: Nian, Sean, et al.
Published: (2026)

FairKV: Balancing Per-Head KV Cache for Fast Multi-GPU Inference
by: Zhao, Bingzhe, et al.
Published: (2025)

PackKV: Reducing KV Cache Memory Footprint through LLM-Aware Lossy Compression
by: Jiang, Bo, et al.
Published: (2025)

FlexKV: Flexible Index Offloading for Memory-Disaggregated Key-Value Store
by: Hu, Zhisheng, et al.
Published: (2025)

KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving
by: Liu, Zedong, et al.
Published: (2026)

GORGO: Maximizing KV-Cache Reuse While Minimizing Network Latency in Cross-Region LLM Load Balancing
by: Toniolo, Alessio Ricci, et al.
Published: (2026)

CascadeInfer: Length-Aware Scheduling of LLM Serving with Low Latency and Load Balancing
by: Yuan, Yitao, et al.
Published: (2025)

PolyKV: A Shared Asymmetrically-Compressed KV Cache Pool for Multi-Agent LLM Inference
by: Patel, Ishan, et al.
Published: (2026)

Taming Request Imbalance: SLO-Aware Scheduling for Disaggregated LLM Inference
by: Wang, Qipeng
Published: (2026)

ContiguousKV: Accelerating LLM Prefill with Granularity-Aligned KV Cache Management
by: Zou, Jing, et al.
Published: (2026)

Leyline: KV Cache Directives for Agentic Inference
by: Ma, Bole, et al.
Published: (2026)

KVPR: Efficient LLM Inference with I/O-Aware KV Cache Partial Recomputation
by: Jiang, Chaoyi, et al.
Published: (2024)

KVSwap: Disk-aware KV Cache Offloading for Long-Context On-device Inference
by: Zhang, Huawei, et al.
Published: (2025)

RcLLM: Accelerating Generative Recommendation via Beyond-Prefix KV Caching
by: Zhao, Zhan, et al.
Published: (2026)

PiKV: KV Cache Management System for Mixture of Experts
by: Liu, Dong, et al.
Published: (2025)

Arrow: Adaptive Scheduling Mechanisms for Disaggregated LLM Inference Architecture
by: Wu, Yu, et al.
Published: (2025)

FastCache: Optimizing Multimodal LLM Serving through Lightweight KV-Cache Compression Framework
by: Zhu, Jianian, et al.
Published: (2025)

KVComp: A High-Performance, LLM-Aware, Lossy Compression Framework for KV Cache
by: Jiang, Bo, et al.
Published: (2025)

TokenDance: Scaling Multi-Agent LLM Serving via Collective KV Cache Sharing
by: Bian, Zhuohang, et al.
Published: (2026)

InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management
by: Lee, Wonbeom, et al.
Published: (2024)

KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache Generation
by: Cho, Minsik, et al.
Published: (2024)

StreamServe: Adaptive Speculative Flows for Low-Latency Disaggregated LLM Serving
by: Kumar, Satyam, et al.
Published: (2026)

Resident KV Claims: A Conformance Contract for Future Reuse under Active KV Pressure
by: Stepanek, Lukas
Published: (2026)

Mell: Memory-Efficient Large Language Model Serving via Multi-GPU KV Cache Management
by: Qianli, Liu, et al.
Published: (2025)

PrefillShare: A Shared Prefill Module for KV Reuse in Multi-LLM Disaggregated Serving
by: Woo, Sunghyeon, et al.
Published: (2026)

Efficient Remote KV Cache Reuse with GPU-native Video Codec
by: Mi, Liang, et al.
Published: (2026)

Adaptive KV Cache Reuse for Fast Long-Context LLM Serving
by: li, Fei, et al.
Published: (2026)

Hyperion: Low-Latency Ultra-HD Video Analytics via Collaborative Vision Transformer Inference
by: Jiang, Linyi, et al.
Published: (2025)

Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads
by: Hu, Cunchen, et al.
Published: (2024)

DUAL-BLADE: Dual-Path NVMe-Direct KV-Cache Offloading for Edge LLM Inference
by: Jeong, Bodon, et al.
Published: (2026)

Cost-Efficient LLM Serving in the Cloud: VM Selection with KV Cache Offloading
by: Kim, Kihyun, et al.
Published: (2025)

A Survey on Large Language Model Acceleration based on KV Cache Management
by: Li, Haoyang, et al.
Published: (2024)

Adaptive Multi-Objective Tiered Storage Configuration for KV Cache in LLM Service
by: Zheng, Xianzhe, et al.
Published: (2026)

FineServe: Precision-Aware KV Slab and Two-Level Scheduling for Heterogeneous Precision LLM Serving
by: Bin, Kyungmin, et al.
Published: (2025)

VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration
by: Tu, Dezhan, et al.
Published: (2024)

SplitZip: Ultra Fast Lossless KV Compression for Disaggregated LLM Serving
by: Guo, Yipin, et al.
Published: (2026)