Saved in:
| Main Authors: | Zheng, Haoyu, Fu, Fangcheng, Wu, Jia, Yuan, Binhang, Zhang, Yongqiang, Wang, Hao, Zhu, Yuanyuan, Yan, Xiao, Jiang, Jiawei |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.06472 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Autopoiesis: A Self-Evolving System Paradigm for LLM Serving Under Runtime Dynamics
by: Jiang, Youhe, et al.
Published: (2026)
by: Jiang, Youhe, et al.
Published: (2026)
Cascadia: An Efficient Cascade Serving System for Large Language Models
by: Jiang, Youhe, et al.
Published: (2025)
by: Jiang, Youhe, et al.
Published: (2025)
HexAGenT: Efficient Agentic LLM Serving via Workflow- and Heterogeneity-Aware Scheduling
by: Peng, You, et al.
Published: (2026)
by: Peng, You, et al.
Published: (2026)
Demystifying Cost-Efficiency in LLM Serving over Heterogeneous GPUs
by: Jiang, Youhe, et al.
Published: (2025)
by: Jiang, Youhe, et al.
Published: (2025)
HexiScale: Facilitating Large Language Model Training over Heterogeneous Hardware
by: Yan, Ran, et al.
Published: (2024)
by: Yan, Ran, et al.
Published: (2024)
LayerKV: Optimizing Large Language Model Serving with Layer-wise KV Cache Management
by: Xiong, Yi, et al.
Published: (2024)
by: Xiong, Yi, et al.
Published: (2024)
PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving
by: Yüzügüler, Ahmet Caner, et al.
Published: (2025)
by: Yüzügüler, Ahmet Caner, et al.
Published: (2025)
BOute: Cost-Efficient LLM Serving with Heterogeneous LLMs and GPUs via Multi-Objective Bayesian Optimization
by: Jiang, Youhe, et al.
Published: (2026)
by: Jiang, Youhe, et al.
Published: (2026)
EVICPRESS: Joint KV-Cache Compression and Eviction for Efficient LLM Serving
by: Feng, Shaoting, et al.
Published: (2025)
by: Feng, Shaoting, et al.
Published: (2025)
KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving
by: Zhong, Zhiqing, et al.
Published: (2026)
by: Zhong, Zhiqing, et al.
Published: (2026)
EFIM: Efficient Serving of LLMs for Infilling Tasks with Improved KV Cache Reuse
by: Guo, Tianyu, et al.
Published: (2025)
by: Guo, Tianyu, et al.
Published: (2025)
Thinking Short and Right Over Thinking Long: Serving LLM Reasoning Efficiently and Accurately
by: Wang, Yuhang, et al.
Published: (2025)
by: Wang, Yuhang, et al.
Published: (2025)
CacheFlow: Efficient LLM Serving with 3D-Parallel KV Cache Restoration
by: Nian, Sean, et al.
Published: (2026)
by: Nian, Sean, et al.
Published: (2026)
Tutti: Making SSD-Backed KV Cache Practical for Long-Context LLM Serving
by: Qiu, Shi, et al.
Published: (2026)
by: Qiu, Shi, et al.
Published: (2026)
ShadowServe: Interference-Free KV Cache Fetching for Distributed Prefix Caching
by: Xiang, Xingyu, et al.
Published: (2025)
by: Xiang, Xingyu, et al.
Published: (2025)
CommonKV: Compressing KV Cache with Cross-layer Parameter Sharing
by: Wang, Yixuan, et al.
Published: (2025)
by: Wang, Yixuan, et al.
Published: (2025)
MTServe: Efficient Serving for Generative Recommendation Models with Hierarchical Caches
by: Wang, Xin, et al.
Published: (2026)
by: Wang, Xin, et al.
Published: (2026)
Mell: Memory-Efficient Large Language Model Serving via Multi-GPU KV Cache Management
by: Qianli, Liu, et al.
Published: (2025)
by: Qianli, Liu, et al.
Published: (2025)
KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving
by: Liu, Zedong, et al.
Published: (2026)
by: Liu, Zedong, et al.
Published: (2026)
Pythia: Exploiting Workflow Predictability for Efficient Agent-Native LLM Serving
by: Yu, Shan, et al.
Published: (2026)
by: Yu, Shan, et al.
Published: (2026)
TridentServe: A Stage-level Serving System for Diffusion Pipelines
by: Xia, Yifei, et al.
Published: (2025)
by: Xia, Yifei, et al.
Published: (2025)
CriticalKV: Optimizing KV Cache Eviction from an Output Perturbation Perspective
by: Feng, Yuan, et al.
Published: (2025)
by: Feng, Yuan, et al.
Published: (2025)
LoopGuard: Breaking Self-Reinforcing Attention Loops via Dynamic KV Cache Intervention
by: Xu, Dongjie, et al.
Published: (2026)
by: Xu, Dongjie, et al.
Published: (2026)
CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving
by: Liu, Yuhan, et al.
Published: (2023)
by: Liu, Yuhan, et al.
Published: (2023)
Cost-Efficient LLM Serving in the Cloud: VM Selection with KV Cache Offloading
by: Kim, Kihyun, et al.
Published: (2025)
by: Kim, Kihyun, et al.
Published: (2025)
ForkKV: Scaling Multi-LoRA Agent Serving via Copy-on-Write Disaggregated KV Cache
by: Wang, Shao, et al.
Published: (2026)
by: Wang, Shao, et al.
Published: (2026)
Adaptive KV Cache Reuse for Fast Long-Context LLM Serving
by: li, Fei, et al.
Published: (2026)
by: li, Fei, et al.
Published: (2026)
FastCache: Optimizing Multimodal LLM Serving through Lightweight KV-Cache Compression Framework
by: Zhu, Jianian, et al.
Published: (2025)
by: Zhu, Jianian, et al.
Published: (2025)
TokenCake: A KV-Cache-centric Serving Framework for LLM-based Multi-Agent Applications
by: Bian, Zhuohang, et al.
Published: (2025)
by: Bian, Zhuohang, et al.
Published: (2025)
Taming the Fragility of KV Cache Eviction in LLM Inference
by: Feng, Yuan, et al.
Published: (2025)
by: Feng, Yuan, et al.
Published: (2025)
PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling
by: Cai, Zefan, et al.
Published: (2024)
by: Cai, Zefan, et al.
Published: (2024)
BanaServe: Unified KV Cache and Dynamic Module Migration for Balancing Disaggregated LLM Serving in AI Infrastructure
by: He, Yiyuan, et al.
Published: (2025)
by: He, Yiyuan, et al.
Published: (2025)
BanaServe: Unified KV Cache and Dynamic Module Migration for Balancing Disaggregated LLM Serving in AI Infrastructure
by: Yiyuan He, et al.
Published: (2026)
by: Yiyuan He, et al.
Published: (2026)
Efficient Multi-round LLM Inference over Disaggregated Serving
by: He, Wenhao, et al.
Published: (2026)
by: He, Wenhao, et al.
Published: (2026)
SparseX: Efficient Segment-Level KV Cache Sharing for Interleaved LLM Serving
by: Zhang, Quqing, et al.
Published: (2026)
by: Zhang, Quqing, et al.
Published: (2026)
Oaken: Fast and Efficient LLM Serving with Online-Offline Hybrid KV Cache Quantization
by: Kim, Minsu, et al.
Published: (2025)
by: Kim, Minsu, et al.
Published: (2025)
Joint Encoding of KV-Cache Blocks for Scalable LLM Serving
by: Kampeas, Joseph, et al.
Published: (2026)
by: Kampeas, Joseph, et al.
Published: (2026)
ReasonCache: Accelerating Large Reasoning Model Serving through KV Cache Sharing
by: Chen, Kaiwen, et al.
Published: (2025)
by: Chen, Kaiwen, et al.
Published: (2025)
InstCache: A Predictive Cache for LLM Serving
by: Zou, Longwei, et al.
Published: (2024)
by: Zou, Longwei, et al.
Published: (2024)
TokenDance: Scaling Multi-Agent LLM Serving via Collective KV Cache Sharing
by: Bian, Zhuohang, et al.
Published: (2026)
by: Bian, Zhuohang, et al.
Published: (2026)
Similar Items
-
Autopoiesis: A Self-Evolving System Paradigm for LLM Serving Under Runtime Dynamics
by: Jiang, Youhe, et al.
Published: (2026) -
Cascadia: An Efficient Cascade Serving System for Large Language Models
by: Jiang, Youhe, et al.
Published: (2025) -
HexAGenT: Efficient Agentic LLM Serving via Workflow- and Heterogeneity-Aware Scheduling
by: Peng, You, et al.
Published: (2026) -
Demystifying Cost-Efficiency in LLM Serving over Heterogeneous GPUs
by: Jiang, Youhe, et al.
Published: (2025) -
HexiScale: Facilitating Large Language Model Training over Heterogeneous Hardware
by: Yan, Ran, et al.
Published: (2024)