Saved in:
| Main Authors: | Zhang, Huawei, Xia, Chunwei, Wang, Zheng |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2511.11907 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
DUAL-BLADE: Dual-Path NVMe-Direct KV-Cache Offloading for Edge LLM Inference
by: Jeong, Bodon, et al.
Published: (2026)
by: Jeong, Bodon, et al.
Published: (2026)
FairKV: Balancing Per-Head KV Cache for Fast Multi-GPU Inference
by: Zhao, Bingzhe, et al.
Published: (2025)
by: Zhao, Bingzhe, et al.
Published: (2025)
Leyline: KV Cache Directives for Agentic Inference
by: Ma, Bole, et al.
Published: (2026)
by: Ma, Bole, et al.
Published: (2026)
FlowKV: A Disaggregated Inference Framework with Low-Latency KV Cache Transfer and Load-Aware Scheduling
by: Li, Weiqing, et al.
Published: (2025)
by: Li, Weiqing, et al.
Published: (2025)
PackKV: Reducing KV Cache Memory Footprint through LLM-Aware Lossy Compression
by: Jiang, Bo, et al.
Published: (2025)
by: Jiang, Bo, et al.
Published: (2025)
ARKV: Adaptive and Resource-Efficient KV Cache Management under Limited Memory Budget for Long-Context Inference in LLMs
by: Lei, Jianlong, et al.
Published: (2026)
by: Lei, Jianlong, et al.
Published: (2026)
KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache Generation
by: Cho, Minsik, et al.
Published: (2024)
by: Cho, Minsik, et al.
Published: (2024)
KVComp: A High-Performance, LLM-Aware, Lossy Compression Framework for KV Cache
by: Jiang, Bo, et al.
Published: (2025)
by: Jiang, Bo, et al.
Published: (2025)
A Survey on Large Language Model Acceleration based on KV Cache Management
by: Li, Haoyang, et al.
Published: (2024)
by: Li, Haoyang, et al.
Published: (2024)
PiKV: KV Cache Management System for Mixture of Experts
by: Liu, Dong, et al.
Published: (2025)
by: Liu, Dong, et al.
Published: (2025)
DynaKV: Enabling Accurate and Efficient Long-Sequence LLM Decoding on Smartphones
by: Wang, Tuowei, et al.
Published: (2025)
by: Wang, Tuowei, et al.
Published: (2025)
ShadowServe: Interference-Free KV Cache Fetching for Distributed Prefix Caching
by: Xiang, Xingyu, et al.
Published: (2025)
by: Xiang, Xingyu, et al.
Published: (2025)
PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving
by: Yüzügüler, Ahmet Caner, et al.
Published: (2025)
by: Yüzügüler, Ahmet Caner, et al.
Published: (2025)
MoSKA: Mixture of Shared KV Attention for Efficient Long-Sequence LLM Inference
by: Rhee, Myunghyun, et al.
Published: (2025)
by: Rhee, Myunghyun, et al.
Published: (2025)
SparOA: Sparse and Operator-aware Hybrid Scheduling for Edge DNN Inference
by: Zhang, Ziyang, et al.
Published: (2025)
by: Zhang, Ziyang, et al.
Published: (2025)
KVCache Cache in the Wild: Characterizing and Optimizing KVCache Cache at a Large Cloud Provider
by: Wang, Jiahao, et al.
Published: (2025)
by: Wang, Jiahao, et al.
Published: (2025)
KV Cache Compression for Inference Efficiency in LLMs: A Review
by: Liu, Yanyu, et al.
Published: (2025)
by: Liu, Yanyu, et al.
Published: (2025)
PIPO: Pipelined Offloading for Efficient Inference on Consumer Devices
by: Liu, Yangyijian, et al.
Published: (2025)
by: Liu, Yangyijian, et al.
Published: (2025)
Inference Offloading for Cost-Sensitive Binary Classification at the Edge
by: Moothedath, Vishnu Narayanan, et al.
Published: (2025)
by: Moothedath, Vishnu Narayanan, et al.
Published: (2025)
A Model Aware AIGC Task Offloading Algorithm in IIoT Edge Computing
by: Wang, Xin, et al.
Published: (2025)
by: Wang, Xin, et al.
Published: (2025)
KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving
by: Liu, Zedong, et al.
Published: (2026)
by: Liu, Zedong, et al.
Published: (2026)
Adaptive KV Cache Reuse for Fast Long-Context LLM Serving
by: li, Fei, et al.
Published: (2026)
by: li, Fei, et al.
Published: (2026)
Joint Resource Optimization, Computation Offloading and Resource Slicing for Multi-Edge Traffic-Cognitive Networks
by: Xiaoyang, Ting, et al.
Published: (2024)
by: Xiaoyang, Ting, et al.
Published: (2024)
TCM-Serve: Modality-aware Scheduling for Multimodal Large Language Model Inference
by: Papaioannou, Konstantinos, et al.
Published: (2026)
by: Papaioannou, Konstantinos, et al.
Published: (2026)
MatKV: Trading Compute for Flash Storage in LLM Inference
by: Shin, Kun-Woo, et al.
Published: (2025)
by: Shin, Kun-Woo, et al.
Published: (2025)
Towards Multi-Model LLM Schedulers: Empirical Insights into Offloading and Preemption
by: Yildiz, Mert, et al.
Published: (2026)
by: Yildiz, Mert, et al.
Published: (2026)
Understanding Bottlenecks for Efficiently Serving LLM Inference With KV Offloading
by: Meng, William, et al.
Published: (2025)
by: Meng, William, et al.
Published: (2025)
KAIROS: Stateful, Context-Aware Power-Efficient Agentic Inference Serving
by: Yuan, Yichao, et al.
Published: (2026)
by: Yuan, Yichao, et al.
Published: (2026)
Efficient Multi-Adapter LLM Serving via Cross-Model KV-Cache Reuse with Activated LoRA
by: Li, Allison, et al.
Published: (2025)
by: Li, Allison, et al.
Published: (2025)
NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference
by: Jiang, Xuanlin, et al.
Published: (2024)
by: Jiang, Xuanlin, et al.
Published: (2024)
Cost-Efficient LLM Serving in the Cloud: VM Selection with KV Cache Offloading
by: Kim, Kihyun, et al.
Published: (2025)
by: Kim, Kihyun, et al.
Published: (2025)
VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration
by: Tu, Dezhan, et al.
Published: (2024)
by: Tu, Dezhan, et al.
Published: (2024)
InfiniPipe: Elastic Pipeline Parallelism for Efficient Variable-Length Long-Context LLM Training
by: Wang, Shiju, et al.
Published: (2025)
by: Wang, Shiju, et al.
Published: (2025)
InstGenIE: Generative Image Editing Made Efficient with Mask-aware Caching and Scheduling
by: Jiang, Xiaoxiao, et al.
Published: (2025)
by: Jiang, Xiaoxiao, et al.
Published: (2025)
OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization
by: Zhou, Zhongzhu, et al.
Published: (2026)
by: Zhou, Zhongzhu, et al.
Published: (2026)
Kascade: A Practical Sparse Attention Method for Long-Context LLM Inference
by: Deshmukh, Dhruv, et al.
Published: (2025)
by: Deshmukh, Dhruv, et al.
Published: (2025)
Towards Efficient Key-Value Cache Management for Prefix Prefilling in LLM Inference
by: Zhu, Yue, et al.
Published: (2025)
by: Zhu, Yue, et al.
Published: (2025)
TinyServe: Query-Aware Cache Selection for Efficient LLM Serving
by: Liu, Dong, et al.
Published: (2025)
by: Liu, Dong, et al.
Published: (2025)
SPECTRE: Hybrid Ordinary-Parallel Speculative Serving for Resource-Efficient LLM Inference
by: Xie, Jincheng, et al.
Published: (2026)
by: Xie, Jincheng, et al.
Published: (2026)
TriMoE: Augmenting GPU with AMX-Enabled CPU and DIMM-NDP for High-Throughput MoE Inference via Offloading
by: Pan, Yudong, et al.
Published: (2026)
by: Pan, Yudong, et al.
Published: (2026)
Similar Items
-
DUAL-BLADE: Dual-Path NVMe-Direct KV-Cache Offloading for Edge LLM Inference
by: Jeong, Bodon, et al.
Published: (2026) -
FairKV: Balancing Per-Head KV Cache for Fast Multi-GPU Inference
by: Zhao, Bingzhe, et al.
Published: (2025) -
Leyline: KV Cache Directives for Agentic Inference
by: Ma, Bole, et al.
Published: (2026) -
FlowKV: A Disaggregated Inference Framework with Low-Latency KV Cache Transfer and Load-Aware Scheduling
by: Li, Weiqing, et al.
Published: (2025) -
PackKV: Reducing KV Cache Memory Footprint through LLM-Aware Lossy Compression
by: Jiang, Bo, et al.
Published: (2025)