:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zhang, Qiuyang, Zhou, Kai, Tang, Ding, Lu, Kai, Li, Cheng, Yang, Zhenyu, Xu, Peng, Wan, Jiguang
Format:	Preprint
Published:	2026
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2603.27138
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity
by: Ma, Da, et al.
Published: (2024)

WindowQuant: Mixed-Precision KV Cache Quantization based on Window-Level Similarity for VLMs Inference Optimization
by: Tao, Wei, et al.
Published: (2026)

FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference
by: Liu, Guangda, et al.
Published: (2025)

LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference
by: Liu, Yuhan, et al.
Published: (2025)

MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache
by: Sharma, Akshat, et al.
Published: (2024)

Cost-Efficient LLM Serving in the Cloud: VM Selection with KV Cache Offloading
by: Kim, Kihyun, et al.
Published: (2025)

Understanding Bottlenecks for Efficiently Serving LLM Inference With KV Offloading
by: Meng, William, et al.
Published: (2025)

Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference
by: Dong, Harry, et al.
Published: (2024)

KeepKV: Achieving Periodic Lossless KV Cache Compression for Efficient LLM Inference
by: Tian, Yuxuan, et al.
Published: (2025)

DUAL-BLADE: Dual-Path NVMe-Direct KV-Cache Offloading for Edge LLM Inference
by: Jeong, Bodon, et al.
Published: (2026)

Hammer: Towards Efficient Hot-Cold Data Identification via Online Learning
by: Lu, Kai, et al.
Published: (2024)

DepthKV: Layer-Dependent KV Cache Pruning for Long-Context LLM Inference
by: Dehghanighobadi, Zahra, et al.
Published: (2026)

ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference
by: Liu, Xiang, et al.
Published: (2025)

SmallKV: Small Model Assisted Compensation of KV Cache Compression for Efficient LLM Inference
by: Zhao, Yi, et al.
Published: (2025)

Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference
by: Feng, Yuan, et al.
Published: (2024)

NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference
by: Jiang, Xuanlin, et al.
Published: (2024)

Continuum: Efficient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to-Live
by: Li, Hanchen, et al.
Published: (2025)

KV Cache Offloading for Context-Intensive Tasks
by: Bocharnikov, Andrey, et al.
Published: (2026)

Efficient Long-Context LLM Inference via KV Cache Clustering
by: Hu, Jie, et al.
Published: (2025)

KVSwap: Disk-aware KV Cache Offloading for Long-Context On-device Inference
by: Zhang, Huawei, et al.
Published: (2025)

SparseX: Efficient Segment-Level KV Cache Sharing for Interleaved LLM Serving
by: Zhang, Quqing, et al.
Published: (2026)

WindowKV: Task-Adaptive Group-Wise KV Cache Window Selection for Efficient LLM Inference
by: Zuo, Youhui, et al.
Published: (2025)

KV Cache Optimization Strategies for Scalable and Efficient LLM Inference
by: Xu, Yichun, et al.
Published: (2026)

Layer-Condensed KV Cache for Efficient Inference of Large Language Models
by: Wu, Haoyi, et al.
Published: (2024)

VeriCache: Turning Lossy KV Cache into Lossless LLM Inference
by: Yao, Jiayi, et al.
Published: (2026)

SparKV: Overhead-Aware KV Cache Loading for Efficient On-Device LLM Inference
by: Liu, Hongyao, et al.
Published: (2026)

SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching
by: Zhu, Yuxuan, et al.
Published: (2025)

Taming the Fragility of KV Cache Eviction in LLM Inference
by: Feng, Yuan, et al.
Published: (2025)

Online Scheduling for LLM Inference with KV Cache Constraints
by: Jaillet, Patrick, et al.
Published: (2025)

EvolKV: Evolutionary KV Cache Compression for LLM Inference
by: Yu, Bohan, et al.
Published: (2025)

SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget
by: Wang, Zihao, et al.
Published: (2024)

G-KV: Decoding-Time KV Cache Eviction with Global Attention
by: Liao, Mengqi, et al.
Published: (2025)

MPCache: MPC-Friendly KV Cache Eviction for Efficient Private LLM Inference
by: Zeng, Wenxuan, et al.
Published: (2025)

AirCache: Activating Inter-modal Relevancy KV Cache Compression for Efficient Large Vision-Language Model Inference
by: Huang, Kai, et al.
Published: (2025)

KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing
by: Yang, Yifei, et al.
Published: (2024)

CachePrune: Privacy-Aware and Fine-Grained KV Cache Sharing for Efficient LLM Inference
by: Wu, Guanlong, et al.
Published: (2026)

Scout Before You Attend: Sketch-and-Walk Sparse Attention for Efficient LLM Inference
by: Le, Hoang Anh Duy, et al.
Published: (2026)

AhaKV: Adaptive Holistic Attention-Driven KV Cache Eviction for Efficient Inference of Large Language Models
by: Gu, Yifeng, et al.
Published: (2025)

xKV: Cross-Layer KV-Cache Compression via Aligned Singular Vector Extraction
by: Chang, Chi-Chih, et al.
Published: (2025)

MoQAE: Mixed-Precision Quantization for Long-Context LLM Inference via Mixture of Quantization-Aware Experts
by: Tao, Wei, et al.
Published: (2025)