:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Shen, Haiying, Sen, Tanmoy, Tanaka, Masahiro
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2503.13773
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

AccelGen: Heterogeneous SLO-Guaranteed High-Throughput LLM Inference Serving for Diverse Applications
by: Shen, Haiying, et al.
Published: (2025)

EvolKV: Evolutionary KV Cache Compression for LLM Inference
by: Yu, Bohan, et al.
Published: (2025)

Shadow in the Cache: Unveiling and Mitigating Privacy Risks of KV-cache in LLM Inference
by: Luo, Zhifan, et al.
Published: (2025)

Taming the Fragility of KV Cache Eviction in LLM Inference
by: Feng, Yuan, et al.
Published: (2025)

EconoServe: Maximizing Multi-Resource Utilization with SLO Guarantees in LLM Serving
by: Shen, Haiying, et al.
Published: (2024)

FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference
by: Liu, Guangda, et al.
Published: (2025)

ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference
by: Liu, Xiang, et al.
Published: (2025)

DASH-KV: Accelerating Long-Context LLM Inference via Asymmetric KV Cache Hashing
by: Guo, Jinyu, et al.
Published: (2026)

ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
by: Sun, Hanshi, et al.
Published: (2024)

DepthKV: Layer-Dependent KV Cache Pruning for Long-Context LLM Inference
by: Dehghanighobadi, Zahra, et al.
Published: (2026)

Efficient Long-Context LLM Inference via KV Cache Clustering
by: Hu, Jie, et al.
Published: (2025)

WindowKV: Task-Adaptive Group-Wise KV Cache Window Selection for Efficient LLM Inference
by: Zuo, Youhui, et al.
Published: (2025)

Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference
by: Feng, Yuan, et al.
Published: (2024)

SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching
by: Zhu, Yuxuan, et al.
Published: (2025)

KeepKV: Achieving Periodic Lossless KV Cache Compression for Efficient LLM Inference
by: Tian, Yuxuan, et al.
Published: (2025)

KV Cache Transform Coding for Compact Storage in LLM Inference
by: Staniszewski, Konrad, et al.
Published: (2025)

PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference
by: Yang, Dongjie, et al.
Published: (2024)

Reformulating KV Cache Eviction Problem for Long-Context LLM Inference
by: Mai, Tho, et al.
Published: (2026)

TailorKV: A Hybrid Framework for Long-Context Inference via Tailored KV Cache Optimization
by: Yao, Dingyu, et al.
Published: (2025)

RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression
by: Behnam, Payman, et al.
Published: (2025)

MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache
by: Sharma, Akshat, et al.
Published: (2024)

Reconstructing KV Caches with Cross-layer Fusion For Enhanced Transformers
by: Lin, Hongzhan, et al.
Published: (2025)

PolyKV: A Shared Asymmetrically-Compressed KV Cache Pool for Multi-Agent LLM Inference
by: Patel, Ishan, et al.
Published: (2026)

MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context Inference
by: Wan, Zhongwei, et al.
Published: (2025)

Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity
by: Ma, Da, et al.
Published: (2024)

TTKV: Temporal-Tiered KV Cache for Long-Context LLM Inference
by: Dzikanyanga, Gradwell, et al.
Published: (2026)

XKV: Personalized KV Cache Memory Reduction for Long-Context LLM Inference
by: Li, Weizhuo, et al.
Published: (2024)

FDC: Fast KV Dimensionality Compression for Efficient LLM Inference
by: Zhang, Zeyu, et al.
Published: (2024)

HeteroCache: A Dynamic Retrieval Approach to Heterogeneous KV Cache Compression for Long-Context LLM Inference
by: Shi, Zhiyuan, et al.
Published: (2026)

OBCache: Optimal Brain KV Cache Pruning for Efficient Long-Context LLM Inference
by: Gu, Yuzhe, et al.
Published: (2025)

Inference-Time Hyper-Scaling with KV Cache Compression
by: Łańcucki, Adrian, et al.
Published: (2025)

QAQ: Quality Adaptive Quantization for LLM KV Cache
by: Dong, Shichen, et al.
Published: (2024)

MadaKV: Adaptive Modality-Perception KV Cache Eviction for Efficient Multimodal Long-Context Inference
by: Li, Kunxi, et al.
Published: (2025)

KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache Generation
by: Cho, Minsik, et al.
Published: (2024)

KVDrive: A Holistic Multi-Tier KV Cache Management System for Long-Context LLM Inference
by: Lin, Jian, et al.
Published: (2026)

BUZZ: Beehive-structured Sparse KV Cache with Segmented Heavy Hitters for Efficient LLM Inference
by: Zhao, Junqi, et al.
Published: (2024)

IndexMem: Learned KV-Cache Eviction with Latent Memory for Long-Context LLM Inference
by: Yang, Xintong, et al.
Published: (2026)

Layer-Condensed KV Cache for Efficient Inference of Large Language Models
by: Wu, Haoyi, et al.
Published: (2024)

GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM
by: Kang, Hao, et al.
Published: (2024)

VecInfer: Efficient LLM Inference with Low-Bit KV Cache via Outlier-Suppressed Vector Quantization
by: Yao, Dingyu, et al.
Published: (2025)