Saved in:
| Main Authors: | Shen, Haiying, Sen, Tanmoy, Tanaka, Masahiro |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2503.13773 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
AccelGen: Heterogeneous SLO-Guaranteed High-Throughput LLM Inference Serving for Diverse Applications
by: Shen, Haiying, et al.
Published: (2025)
by: Shen, Haiying, et al.
Published: (2025)
EvolKV: Evolutionary KV Cache Compression for LLM Inference
by: Yu, Bohan, et al.
Published: (2025)
by: Yu, Bohan, et al.
Published: (2025)
Shadow in the Cache: Unveiling and Mitigating Privacy Risks of KV-cache in LLM Inference
by: Luo, Zhifan, et al.
Published: (2025)
by: Luo, Zhifan, et al.
Published: (2025)
Taming the Fragility of KV Cache Eviction in LLM Inference
by: Feng, Yuan, et al.
Published: (2025)
by: Feng, Yuan, et al.
Published: (2025)
EconoServe: Maximizing Multi-Resource Utilization with SLO Guarantees in LLM Serving
by: Shen, Haiying, et al.
Published: (2024)
by: Shen, Haiying, et al.
Published: (2024)
FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference
by: Liu, Guangda, et al.
Published: (2025)
by: Liu, Guangda, et al.
Published: (2025)
ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference
by: Liu, Xiang, et al.
Published: (2025)
by: Liu, Xiang, et al.
Published: (2025)
DASH-KV: Accelerating Long-Context LLM Inference via Asymmetric KV Cache Hashing
by: Guo, Jinyu, et al.
Published: (2026)
by: Guo, Jinyu, et al.
Published: (2026)
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
by: Sun, Hanshi, et al.
Published: (2024)
by: Sun, Hanshi, et al.
Published: (2024)
DepthKV: Layer-Dependent KV Cache Pruning for Long-Context LLM Inference
by: Dehghanighobadi, Zahra, et al.
Published: (2026)
by: Dehghanighobadi, Zahra, et al.
Published: (2026)
Efficient Long-Context LLM Inference via KV Cache Clustering
by: Hu, Jie, et al.
Published: (2025)
by: Hu, Jie, et al.
Published: (2025)
WindowKV: Task-Adaptive Group-Wise KV Cache Window Selection for Efficient LLM Inference
by: Zuo, Youhui, et al.
Published: (2025)
by: Zuo, Youhui, et al.
Published: (2025)
Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference
by: Feng, Yuan, et al.
Published: (2024)
by: Feng, Yuan, et al.
Published: (2024)
SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching
by: Zhu, Yuxuan, et al.
Published: (2025)
by: Zhu, Yuxuan, et al.
Published: (2025)
KeepKV: Achieving Periodic Lossless KV Cache Compression for Efficient LLM Inference
by: Tian, Yuxuan, et al.
Published: (2025)
by: Tian, Yuxuan, et al.
Published: (2025)
KV Cache Transform Coding for Compact Storage in LLM Inference
by: Staniszewski, Konrad, et al.
Published: (2025)
by: Staniszewski, Konrad, et al.
Published: (2025)
PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference
by: Yang, Dongjie, et al.
Published: (2024)
by: Yang, Dongjie, et al.
Published: (2024)
Reformulating KV Cache Eviction Problem for Long-Context LLM Inference
by: Mai, Tho, et al.
Published: (2026)
by: Mai, Tho, et al.
Published: (2026)
TailorKV: A Hybrid Framework for Long-Context Inference via Tailored KV Cache Optimization
by: Yao, Dingyu, et al.
Published: (2025)
by: Yao, Dingyu, et al.
Published: (2025)
RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression
by: Behnam, Payman, et al.
Published: (2025)
by: Behnam, Payman, et al.
Published: (2025)
MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache
by: Sharma, Akshat, et al.
Published: (2024)
by: Sharma, Akshat, et al.
Published: (2024)
Reconstructing KV Caches with Cross-layer Fusion For Enhanced Transformers
by: Lin, Hongzhan, et al.
Published: (2025)
by: Lin, Hongzhan, et al.
Published: (2025)
PolyKV: A Shared Asymmetrically-Compressed KV Cache Pool for Multi-Agent LLM Inference
by: Patel, Ishan, et al.
Published: (2026)
by: Patel, Ishan, et al.
Published: (2026)
MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context Inference
by: Wan, Zhongwei, et al.
Published: (2025)
by: Wan, Zhongwei, et al.
Published: (2025)
Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity
by: Ma, Da, et al.
Published: (2024)
by: Ma, Da, et al.
Published: (2024)
TTKV: Temporal-Tiered KV Cache for Long-Context LLM Inference
by: Dzikanyanga, Gradwell, et al.
Published: (2026)
by: Dzikanyanga, Gradwell, et al.
Published: (2026)
XKV: Personalized KV Cache Memory Reduction for Long-Context LLM Inference
by: Li, Weizhuo, et al.
Published: (2024)
by: Li, Weizhuo, et al.
Published: (2024)
FDC: Fast KV Dimensionality Compression for Efficient LLM Inference
by: Zhang, Zeyu, et al.
Published: (2024)
by: Zhang, Zeyu, et al.
Published: (2024)
HeteroCache: A Dynamic Retrieval Approach to Heterogeneous KV Cache Compression for Long-Context LLM Inference
by: Shi, Zhiyuan, et al.
Published: (2026)
by: Shi, Zhiyuan, et al.
Published: (2026)
OBCache: Optimal Brain KV Cache Pruning for Efficient Long-Context LLM Inference
by: Gu, Yuzhe, et al.
Published: (2025)
by: Gu, Yuzhe, et al.
Published: (2025)
Inference-Time Hyper-Scaling with KV Cache Compression
by: Łańcucki, Adrian, et al.
Published: (2025)
by: Łańcucki, Adrian, et al.
Published: (2025)
QAQ: Quality Adaptive Quantization for LLM KV Cache
by: Dong, Shichen, et al.
Published: (2024)
by: Dong, Shichen, et al.
Published: (2024)
MadaKV: Adaptive Modality-Perception KV Cache Eviction for Efficient Multimodal Long-Context Inference
by: Li, Kunxi, et al.
Published: (2025)
by: Li, Kunxi, et al.
Published: (2025)
KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache Generation
by: Cho, Minsik, et al.
Published: (2024)
by: Cho, Minsik, et al.
Published: (2024)
KVDrive: A Holistic Multi-Tier KV Cache Management System for Long-Context LLM Inference
by: Lin, Jian, et al.
Published: (2026)
by: Lin, Jian, et al.
Published: (2026)
BUZZ: Beehive-structured Sparse KV Cache with Segmented Heavy Hitters for Efficient LLM Inference
by: Zhao, Junqi, et al.
Published: (2024)
by: Zhao, Junqi, et al.
Published: (2024)
IndexMem: Learned KV-Cache Eviction with Latent Memory for Long-Context LLM Inference
by: Yang, Xintong, et al.
Published: (2026)
by: Yang, Xintong, et al.
Published: (2026)
Layer-Condensed KV Cache for Efficient Inference of Large Language Models
by: Wu, Haoyi, et al.
Published: (2024)
by: Wu, Haoyi, et al.
Published: (2024)
GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM
by: Kang, Hao, et al.
Published: (2024)
by: Kang, Hao, et al.
Published: (2024)
VecInfer: Efficient LLM Inference with Low-Bit KV Cache via Outlier-Suppressed Vector Quantization
by: Yao, Dingyu, et al.
Published: (2025)
by: Yao, Dingyu, et al.
Published: (2025)
Similar Items
-
AccelGen: Heterogeneous SLO-Guaranteed High-Throughput LLM Inference Serving for Diverse Applications
by: Shen, Haiying, et al.
Published: (2025) -
EvolKV: Evolutionary KV Cache Compression for LLM Inference
by: Yu, Bohan, et al.
Published: (2025) -
Shadow in the Cache: Unveiling and Mitigating Privacy Risks of KV-cache in LLM Inference
by: Luo, Zhifan, et al.
Published: (2025) -
Taming the Fragility of KV Cache Eviction in LLM Inference
by: Feng, Yuan, et al.
Published: (2025) -
EconoServe: Maximizing Multi-Resource Utilization with SLO Guarantees in LLM Serving
by: Shen, Haiying, et al.
Published: (2024)