:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zheng, Haoyu, Fu, Fangcheng, Wu, Jia, Yuan, Binhang, Zhang, Yongqiang, Wang, Hao, Zhu, Yuanyuan, Yan, Xiao, Jiang, Jiawei
Format:	Preprint
Published:	2026
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2605.06472
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Autopoiesis: A Self-Evolving System Paradigm for LLM Serving Under Runtime Dynamics
by: Jiang, Youhe, et al.
Published: (2026)

Cascadia: An Efficient Cascade Serving System for Large Language Models
by: Jiang, Youhe, et al.
Published: (2025)

HexAGenT: Efficient Agentic LLM Serving via Workflow- and Heterogeneity-Aware Scheduling
by: Peng, You, et al.
Published: (2026)

Demystifying Cost-Efficiency in LLM Serving over Heterogeneous GPUs
by: Jiang, Youhe, et al.
Published: (2025)

HexiScale: Facilitating Large Language Model Training over Heterogeneous Hardware
by: Yan, Ran, et al.
Published: (2024)

LayerKV: Optimizing Large Language Model Serving with Layer-wise KV Cache Management
by: Xiong, Yi, et al.
Published: (2024)

PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving
by: Yüzügüler, Ahmet Caner, et al.
Published: (2025)

BOute: Cost-Efficient LLM Serving with Heterogeneous LLMs and GPUs via Multi-Objective Bayesian Optimization
by: Jiang, Youhe, et al.
Published: (2026)

EVICPRESS: Joint KV-Cache Compression and Eviction for Efficient LLM Serving
by: Feng, Shaoting, et al.
Published: (2025)

KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving
by: Zhong, Zhiqing, et al.
Published: (2026)

EFIM: Efficient Serving of LLMs for Infilling Tasks with Improved KV Cache Reuse
by: Guo, Tianyu, et al.
Published: (2025)

Thinking Short and Right Over Thinking Long: Serving LLM Reasoning Efficiently and Accurately
by: Wang, Yuhang, et al.
Published: (2025)

CacheFlow: Efficient LLM Serving with 3D-Parallel KV Cache Restoration
by: Nian, Sean, et al.
Published: (2026)

Tutti: Making SSD-Backed KV Cache Practical for Long-Context LLM Serving
by: Qiu, Shi, et al.
Published: (2026)

ShadowServe: Interference-Free KV Cache Fetching for Distributed Prefix Caching
by: Xiang, Xingyu, et al.
Published: (2025)

CommonKV: Compressing KV Cache with Cross-layer Parameter Sharing
by: Wang, Yixuan, et al.
Published: (2025)

MTServe: Efficient Serving for Generative Recommendation Models with Hierarchical Caches
by: Wang, Xin, et al.
Published: (2026)

Mell: Memory-Efficient Large Language Model Serving via Multi-GPU KV Cache Management
by: Qianli, Liu, et al.
Published: (2025)

KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving
by: Liu, Zedong, et al.
Published: (2026)

Pythia: Exploiting Workflow Predictability for Efficient Agent-Native LLM Serving
by: Yu, Shan, et al.
Published: (2026)

TridentServe: A Stage-level Serving System for Diffusion Pipelines
by: Xia, Yifei, et al.
Published: (2025)

CriticalKV: Optimizing KV Cache Eviction from an Output Perturbation Perspective
by: Feng, Yuan, et al.
Published: (2025)

LoopGuard: Breaking Self-Reinforcing Attention Loops via Dynamic KV Cache Intervention
by: Xu, Dongjie, et al.
Published: (2026)

CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving
by: Liu, Yuhan, et al.
Published: (2023)

Cost-Efficient LLM Serving in the Cloud: VM Selection with KV Cache Offloading
by: Kim, Kihyun, et al.
Published: (2025)

ForkKV: Scaling Multi-LoRA Agent Serving via Copy-on-Write Disaggregated KV Cache
by: Wang, Shao, et al.
Published: (2026)

Adaptive KV Cache Reuse for Fast Long-Context LLM Serving
by: li, Fei, et al.
Published: (2026)

FastCache: Optimizing Multimodal LLM Serving through Lightweight KV-Cache Compression Framework
by: Zhu, Jianian, et al.
Published: (2025)

TokenCake: A KV-Cache-centric Serving Framework for LLM-based Multi-Agent Applications
by: Bian, Zhuohang, et al.
Published: (2025)

Taming the Fragility of KV Cache Eviction in LLM Inference
by: Feng, Yuan, et al.
Published: (2025)

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling
by: Cai, Zefan, et al.
Published: (2024)

BanaServe: Unified KV Cache and Dynamic Module Migration for Balancing Disaggregated LLM Serving in AI Infrastructure
by: He, Yiyuan, et al.
Published: (2025)

BanaServe: Unified KV Cache and Dynamic Module Migration for Balancing Disaggregated LLM Serving in AI Infrastructure
by: Yiyuan He, et al.
Published: (2026)

Efficient Multi-round LLM Inference over Disaggregated Serving
by: He, Wenhao, et al.
Published: (2026)

SparseX: Efficient Segment-Level KV Cache Sharing for Interleaved LLM Serving
by: Zhang, Quqing, et al.
Published: (2026)

Oaken: Fast and Efficient LLM Serving with Online-Offline Hybrid KV Cache Quantization
by: Kim, Minsu, et al.
Published: (2025)

Joint Encoding of KV-Cache Blocks for Scalable LLM Serving
by: Kampeas, Joseph, et al.
Published: (2026)

ReasonCache: Accelerating Large Reasoning Model Serving through KV Cache Sharing
by: Chen, Kaiwen, et al.
Published: (2025)

InstCache: A Predictive Cache for LLM Serving
by: Zou, Longwei, et al.
Published: (2024)

TokenDance: Scaling Multi-Agent LLM Serving via Collective KV Cache Sharing
by: Bian, Zhuohang, et al.
Published: (2026)