Saved in:
| Main Authors: | Zhang, Qiuyang, Zhou, Kai, Tang, Ding, Lu, Kai, Li, Cheng, Yang, Zhenyu, Xu, Peng, Wan, Jiguang |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2603.27138 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity
by: Ma, Da, et al.
Published: (2024)
by: Ma, Da, et al.
Published: (2024)
WindowQuant: Mixed-Precision KV Cache Quantization based on Window-Level Similarity for VLMs Inference Optimization
by: Tao, Wei, et al.
Published: (2026)
by: Tao, Wei, et al.
Published: (2026)
FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference
by: Liu, Guangda, et al.
Published: (2025)
by: Liu, Guangda, et al.
Published: (2025)
LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference
by: Liu, Yuhan, et al.
Published: (2025)
by: Liu, Yuhan, et al.
Published: (2025)
MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache
by: Sharma, Akshat, et al.
Published: (2024)
by: Sharma, Akshat, et al.
Published: (2024)
Cost-Efficient LLM Serving in the Cloud: VM Selection with KV Cache Offloading
by: Kim, Kihyun, et al.
Published: (2025)
by: Kim, Kihyun, et al.
Published: (2025)
Understanding Bottlenecks for Efficiently Serving LLM Inference With KV Offloading
by: Meng, William, et al.
Published: (2025)
by: Meng, William, et al.
Published: (2025)
Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference
by: Dong, Harry, et al.
Published: (2024)
by: Dong, Harry, et al.
Published: (2024)
KeepKV: Achieving Periodic Lossless KV Cache Compression for Efficient LLM Inference
by: Tian, Yuxuan, et al.
Published: (2025)
by: Tian, Yuxuan, et al.
Published: (2025)
DUAL-BLADE: Dual-Path NVMe-Direct KV-Cache Offloading for Edge LLM Inference
by: Jeong, Bodon, et al.
Published: (2026)
by: Jeong, Bodon, et al.
Published: (2026)
Hammer: Towards Efficient Hot-Cold Data Identification via Online Learning
by: Lu, Kai, et al.
Published: (2024)
by: Lu, Kai, et al.
Published: (2024)
DepthKV: Layer-Dependent KV Cache Pruning for Long-Context LLM Inference
by: Dehghanighobadi, Zahra, et al.
Published: (2026)
by: Dehghanighobadi, Zahra, et al.
Published: (2026)
ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference
by: Liu, Xiang, et al.
Published: (2025)
by: Liu, Xiang, et al.
Published: (2025)
SmallKV: Small Model Assisted Compensation of KV Cache Compression for Efficient LLM Inference
by: Zhao, Yi, et al.
Published: (2025)
by: Zhao, Yi, et al.
Published: (2025)
Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference
by: Feng, Yuan, et al.
Published: (2024)
by: Feng, Yuan, et al.
Published: (2024)
NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference
by: Jiang, Xuanlin, et al.
Published: (2024)
by: Jiang, Xuanlin, et al.
Published: (2024)
Continuum: Efficient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to-Live
by: Li, Hanchen, et al.
Published: (2025)
by: Li, Hanchen, et al.
Published: (2025)
KV Cache Offloading for Context-Intensive Tasks
by: Bocharnikov, Andrey, et al.
Published: (2026)
by: Bocharnikov, Andrey, et al.
Published: (2026)
Efficient Long-Context LLM Inference via KV Cache Clustering
by: Hu, Jie, et al.
Published: (2025)
by: Hu, Jie, et al.
Published: (2025)
KVSwap: Disk-aware KV Cache Offloading for Long-Context On-device Inference
by: Zhang, Huawei, et al.
Published: (2025)
by: Zhang, Huawei, et al.
Published: (2025)
SparseX: Efficient Segment-Level KV Cache Sharing for Interleaved LLM Serving
by: Zhang, Quqing, et al.
Published: (2026)
by: Zhang, Quqing, et al.
Published: (2026)
WindowKV: Task-Adaptive Group-Wise KV Cache Window Selection for Efficient LLM Inference
by: Zuo, Youhui, et al.
Published: (2025)
by: Zuo, Youhui, et al.
Published: (2025)
KV Cache Optimization Strategies for Scalable and Efficient LLM Inference
by: Xu, Yichun, et al.
Published: (2026)
by: Xu, Yichun, et al.
Published: (2026)
Layer-Condensed KV Cache for Efficient Inference of Large Language Models
by: Wu, Haoyi, et al.
Published: (2024)
by: Wu, Haoyi, et al.
Published: (2024)
VeriCache: Turning Lossy KV Cache into Lossless LLM Inference
by: Yao, Jiayi, et al.
Published: (2026)
by: Yao, Jiayi, et al.
Published: (2026)
SparKV: Overhead-Aware KV Cache Loading for Efficient On-Device LLM Inference
by: Liu, Hongyao, et al.
Published: (2026)
by: Liu, Hongyao, et al.
Published: (2026)
SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching
by: Zhu, Yuxuan, et al.
Published: (2025)
by: Zhu, Yuxuan, et al.
Published: (2025)
Taming the Fragility of KV Cache Eviction in LLM Inference
by: Feng, Yuan, et al.
Published: (2025)
by: Feng, Yuan, et al.
Published: (2025)
Online Scheduling for LLM Inference with KV Cache Constraints
by: Jaillet, Patrick, et al.
Published: (2025)
by: Jaillet, Patrick, et al.
Published: (2025)
EvolKV: Evolutionary KV Cache Compression for LLM Inference
by: Yu, Bohan, et al.
Published: (2025)
by: Yu, Bohan, et al.
Published: (2025)
SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget
by: Wang, Zihao, et al.
Published: (2024)
by: Wang, Zihao, et al.
Published: (2024)
G-KV: Decoding-Time KV Cache Eviction with Global Attention
by: Liao, Mengqi, et al.
Published: (2025)
by: Liao, Mengqi, et al.
Published: (2025)
MPCache: MPC-Friendly KV Cache Eviction for Efficient Private LLM Inference
by: Zeng, Wenxuan, et al.
Published: (2025)
by: Zeng, Wenxuan, et al.
Published: (2025)
AirCache: Activating Inter-modal Relevancy KV Cache Compression for Efficient Large Vision-Language Model Inference
by: Huang, Kai, et al.
Published: (2025)
by: Huang, Kai, et al.
Published: (2025)
KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing
by: Yang, Yifei, et al.
Published: (2024)
by: Yang, Yifei, et al.
Published: (2024)
CachePrune: Privacy-Aware and Fine-Grained KV Cache Sharing for Efficient LLM Inference
by: Wu, Guanlong, et al.
Published: (2026)
by: Wu, Guanlong, et al.
Published: (2026)
Scout Before You Attend: Sketch-and-Walk Sparse Attention for Efficient LLM Inference
by: Le, Hoang Anh Duy, et al.
Published: (2026)
by: Le, Hoang Anh Duy, et al.
Published: (2026)
AhaKV: Adaptive Holistic Attention-Driven KV Cache Eviction for Efficient Inference of Large Language Models
by: Gu, Yifeng, et al.
Published: (2025)
by: Gu, Yifeng, et al.
Published: (2025)
xKV: Cross-Layer KV-Cache Compression via Aligned Singular Vector Extraction
by: Chang, Chi-Chih, et al.
Published: (2025)
by: Chang, Chi-Chih, et al.
Published: (2025)
MoQAE: Mixed-Precision Quantization for Long-Context LLM Inference via Mixture of Quantization-Aware Experts
by: Tao, Wei, et al.
Published: (2025)
by: Tao, Wei, et al.
Published: (2025)
Similar Items
-
Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity
by: Ma, Da, et al.
Published: (2024) -
WindowQuant: Mixed-Precision KV Cache Quantization based on Window-Level Similarity for VLMs Inference Optimization
by: Tao, Wei, et al.
Published: (2026) -
FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference
by: Liu, Guangda, et al.
Published: (2025) -
LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference
by: Liu, Yuhan, et al.
Published: (2025) -
MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache
by: Sharma, Akshat, et al.
Published: (2024)