Saved in:
| Main Authors: | Wang, Zihao, Cui, Bin, Gan, Shaoduo |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2404.04793 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
LKV: End-to-End Learning of Head-wise Budgets and Token Selection for LLM KV Cache Eviction
by: Zhou, Enshuai, et al.
Published: (2026)
by: Zhou, Enshuai, et al.
Published: (2026)
MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache
by: Sharma, Akshat, et al.
Published: (2024)
by: Sharma, Akshat, et al.
Published: (2024)
In-context KV-Cache Eviction for LLMs via Attention-Gate
by: Zeng, Zihao, et al.
Published: (2024)
by: Zeng, Zihao, et al.
Published: (2024)
KeepKV: Achieving Periodic Lossless KV Cache Compression for Efficient LLM Inference
by: Tian, Yuxuan, et al.
Published: (2025)
by: Tian, Yuxuan, et al.
Published: (2025)
EvolKV: Evolutionary KV Cache Compression for LLM Inference
by: Yu, Bohan, et al.
Published: (2025)
by: Yu, Bohan, et al.
Published: (2025)
RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression
by: Behnam, Payman, et al.
Published: (2025)
by: Behnam, Payman, et al.
Published: (2025)
SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching
by: Zhu, Yuxuan, et al.
Published: (2025)
by: Zhu, Yuxuan, et al.
Published: (2025)
AttnCache: Accelerating Self-Attention Inference for LLM Prefill via Attention Cache
by: Song, Dinghong, et al.
Published: (2025)
by: Song, Dinghong, et al.
Published: (2025)
FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference
by: Liu, Guangda, et al.
Published: (2025)
by: Liu, Guangda, et al.
Published: (2025)
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
by: Sun, Hanshi, et al.
Published: (2024)
by: Sun, Hanshi, et al.
Published: (2024)
AttentionPredictor: Temporal Patterns Matter for KV Cache Compression
by: Yang, Qingyue, et al.
Published: (2025)
by: Yang, Qingyue, et al.
Published: (2025)
KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing
by: Yang, Yifei, et al.
Published: (2024)
by: Yang, Yifei, et al.
Published: (2024)
XKV: Personalized KV Cache Memory Reduction for Long-Context LLM Inference
by: Li, Weizhuo, et al.
Published: (2024)
by: Li, Weizhuo, et al.
Published: (2024)
KV Cache Transform Coding for Compact Storage in LLM Inference
by: Staniszewski, Konrad, et al.
Published: (2025)
by: Staniszewski, Konrad, et al.
Published: (2025)
xKV: Cross-Layer KV-Cache Compression via Aligned Singular Vector Extraction
by: Chang, Chi-Chih, et al.
Published: (2025)
by: Chang, Chi-Chih, et al.
Published: (2025)
TTKV: Temporal-Tiered KV Cache for Long-Context LLM Inference
by: Dzikanyanga, Gradwell, et al.
Published: (2026)
by: Dzikanyanga, Gradwell, et al.
Published: (2026)
Sparse Attention across Multiple-context KV Cache
by: Cao, Ziyi, et al.
Published: (2025)
by: Cao, Ziyi, et al.
Published: (2025)
Inference-Time Hyper-Scaling with KV Cache Compression
by: Łańcucki, Adrian, et al.
Published: (2025)
by: Łańcucki, Adrian, et al.
Published: (2025)
SWAN: Sparse Winnowed Attention for Reduced Inference Memory via Decompression-Free KV-Cache Compression
by: S, Santhosh G, et al.
Published: (2025)
by: S, Santhosh G, et al.
Published: (2025)
PolyKV: A Shared Asymmetrically-Compressed KV Cache Pool for Multi-Agent LLM Inference
by: Patel, Ishan, et al.
Published: (2026)
by: Patel, Ishan, et al.
Published: (2026)
LAVa: Layer-wise KV Cache Eviction with Dynamic Budget Allocation
by: Shen, Yiqun, et al.
Published: (2025)
by: Shen, Yiqun, et al.
Published: (2025)
KVTuner: Sensitivity-Aware Layer-Wise Mixed-Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference
by: Li, Xing, et al.
Published: (2025)
by: Li, Xing, et al.
Published: (2025)
Crystal-KV: Efficient KV Cache Management for Chain-of-Thought LLMs via Answer-First Principle
by: Wang, Zihan, et al.
Published: (2026)
by: Wang, Zihan, et al.
Published: (2026)
RazorAttention: Efficient KV Cache Compression Through Retrieval Heads
by: Tang, Hanlin, et al.
Published: (2024)
by: Tang, Hanlin, et al.
Published: (2024)
Beyond KV Caching: Shared Attention for Efficient LLMs
by: Liao, Bingli, et al.
Published: (2024)
by: Liao, Bingli, et al.
Published: (2024)
GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM
by: Kang, Hao, et al.
Published: (2024)
by: Kang, Hao, et al.
Published: (2024)
Eigen Attention: Attention in Low-Rank Space for KV Cache Compression
by: Saxena, Utkarsh, et al.
Published: (2024)
by: Saxena, Utkarsh, et al.
Published: (2024)
CAOTE: KV Cache Selection for LLMs via Attention Output Error-Based Token Eviction
by: Goel, Raghavv, et al.
Published: (2025)
by: Goel, Raghavv, et al.
Published: (2025)
HCAttention: Extreme KV Cache Compression via Heterogeneous Attention Computing for LLMs
by: Yang, Dongquan, et al.
Published: (2025)
by: Yang, Dongquan, et al.
Published: (2025)
LLMs Know What to Drop: Self-Attention Guided KV Cache Eviction for Efficient Long-Context Inference
by: Wang, Guangtao, et al.
Published: (2025)
by: Wang, Guangtao, et al.
Published: (2025)
LayerKV: Optimizing Large Language Model Serving with Layer-wise KV Cache Management
by: Xiong, Yi, et al.
Published: (2024)
by: Xiong, Yi, et al.
Published: (2024)
ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition
by: Ye, Lu, et al.
Published: (2024)
by: Ye, Lu, et al.
Published: (2024)
MadaKV: Adaptive Modality-Perception KV Cache Eviction for Efficient Multimodal Long-Context Inference
by: Li, Kunxi, et al.
Published: (2025)
by: Li, Kunxi, et al.
Published: (2025)
Attention Is All You Need for KV Cache in Diffusion LLMs
by: Nguyen-Tri, Quan, et al.
Published: (2025)
by: Nguyen-Tri, Quan, et al.
Published: (2025)
On the Effect of Uncertainty on Layer-wise Inference Dynamics
by: Kim, Sunwoo, et al.
Published: (2025)
by: Kim, Sunwoo, et al.
Published: (2025)
MoE-nD: Per-Layer Mixture-of-Experts Routing for Multi-Axis KV Cache Compression
by: Sun, Libo, et al.
Published: (2026)
by: Sun, Libo, et al.
Published: (2026)
Mixture of Weight-shared Heterogeneous Group Attention Experts for Dynamic Token-wise KV Optimization
by: Song, Guanghui, et al.
Published: (2025)
by: Song, Guanghui, et al.
Published: (2025)
KV-Fold: One-Step KV-Cache Recurrence for Long-Context Inference
by: Nadali, Alireza, et al.
Published: (2026)
by: Nadali, Alireza, et al.
Published: (2026)
RateQuant: Optimal Mixed-Precision KV Cache Quantization via Rate-Distortion Theory
by: Zuo, Fei, et al.
Published: (2026)
by: Zuo, Fei, et al.
Published: (2026)
IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse
by: Bai, Yushi, et al.
Published: (2026)
by: Bai, Yushi, et al.
Published: (2026)
Similar Items
-
LKV: End-to-End Learning of Head-wise Budgets and Token Selection for LLM KV Cache Eviction
by: Zhou, Enshuai, et al.
Published: (2026) -
MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache
by: Sharma, Akshat, et al.
Published: (2024) -
In-context KV-Cache Eviction for LLMs via Attention-Gate
by: Zeng, Zihao, et al.
Published: (2024) -
KeepKV: Achieving Periodic Lossless KV Cache Compression for Efficient LLM Inference
by: Tian, Yuxuan, et al.
Published: (2025) -
EvolKV: Evolutionary KV Cache Compression for LLM Inference
by: Yu, Bohan, et al.
Published: (2025)