Saved in:
| Main Authors: | Łańcucki, Adrian, Staniszewski, Konrad, Nawrot, Piotr, Ponti, Edoardo M. |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2506.05345 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
KV Cache Transform Coding for Compact Storage in LLM Inference
by: Staniszewski, Konrad, et al.
Published: (2025)
by: Staniszewski, Konrad, et al.
Published: (2025)
Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference
by: Nawrot, Piotr, et al.
Published: (2024)
by: Nawrot, Piotr, et al.
Published: (2024)
The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs
by: Nawrot, Piotr, et al.
Published: (2025)
by: Nawrot, Piotr, et al.
Published: (2025)
EvolKV: Evolutionary KV Cache Compression for LLM Inference
by: Yu, Bohan, et al.
Published: (2025)
by: Yu, Bohan, et al.
Published: (2025)
RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression
by: Behnam, Payman, et al.
Published: (2025)
by: Behnam, Payman, et al.
Published: (2025)
KeepKV: Achieving Periodic Lossless KV Cache Compression for Efficient LLM Inference
by: Tian, Yuxuan, et al.
Published: (2025)
by: Tian, Yuxuan, et al.
Published: (2025)
KVSculpt: KV Cache Compression as Distillation
by: Jiang, Bo, et al.
Published: (2026)
by: Jiang, Bo, et al.
Published: (2026)
PolyKV: A Shared Asymmetrically-Compressed KV Cache Pool for Multi-Agent LLM Inference
by: Patel, Ishan, et al.
Published: (2026)
by: Patel, Ishan, et al.
Published: (2026)
FastKV: Decoupling of Context Reduction and KV Cache Compression for Prefill-Decoding Acceleration
by: Jo, Dongwon, et al.
Published: (2025)
by: Jo, Dongwon, et al.
Published: (2025)
GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM
by: Kang, Hao, et al.
Published: (2024)
by: Kang, Hao, et al.
Published: (2024)
AttentionPredictor: Temporal Patterns Matter for KV Cache Compression
by: Yang, Qingyue, et al.
Published: (2025)
by: Yang, Qingyue, et al.
Published: (2025)
LongFlow: Efficient KV Cache Compression for Reasoning Models
by: Su, Yi, et al.
Published: (2026)
by: Su, Yi, et al.
Published: (2026)
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
by: Sun, Hanshi, et al.
Published: (2024)
by: Sun, Hanshi, et al.
Published: (2024)
xKV: Cross-Layer KV-Cache Compression via Aligned Singular Vector Extraction
by: Chang, Chi-Chih, et al.
Published: (2025)
by: Chang, Chi-Chih, et al.
Published: (2025)
FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference
by: Liu, Guangda, et al.
Published: (2025)
by: Liu, Guangda, et al.
Published: (2025)
OjaKV: Context-Aware Online Low-Rank KV Cache Compression
by: Zhu, Yuxuan, et al.
Published: (2025)
by: Zhu, Yuxuan, et al.
Published: (2025)
Quantization Dominates Rank Reduction for KV-Cache Compression
by: Salfati, Samuel
Published: (2026)
by: Salfati, Samuel
Published: (2026)
Time and Memory Trade-off of KV-Cache Compression in Tensor Transformer Decoding
by: Chen, Yifang, et al.
Published: (2025)
by: Chen, Yifang, et al.
Published: (2025)
MiniCache: KV Cache Compression in Depth Dimension for Large Language Models
by: Liu, Akide, et al.
Published: (2024)
by: Liu, Akide, et al.
Published: (2024)
RazorAttention: Efficient KV Cache Compression Through Retrieval Heads
by: Tang, Hanlin, et al.
Published: (2024)
by: Tang, Hanlin, et al.
Published: (2024)
ManifoldKV: Training-Free KV Cache Compression via Euclidean Outlier Detection
by: Datta, Debajyoti, et al.
Published: (2026)
by: Datta, Debajyoti, et al.
Published: (2026)
MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache
by: Sharma, Akshat, et al.
Published: (2024)
by: Sharma, Akshat, et al.
Published: (2024)
SWAN: Sparse Winnowed Attention for Reduced Inference Memory via Decompression-Free KV-Cache Compression
by: S, Santhosh G, et al.
Published: (2025)
by: S, Santhosh G, et al.
Published: (2025)
SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching
by: Zhu, Yuxuan, et al.
Published: (2025)
by: Zhu, Yuxuan, et al.
Published: (2025)
XKV: Personalized KV Cache Memory Reduction for Long-Context LLM Inference
by: Li, Weizhuo, et al.
Published: (2024)
by: Li, Weizhuo, et al.
Published: (2024)
Eigen Attention: Attention in Low-Rank Space for KV Cache Compression
by: Saxena, Utkarsh, et al.
Published: (2024)
by: Saxena, Utkarsh, et al.
Published: (2024)
Scaling Test-Time Inference with Policy-Optimized, Dynamic Retrieval-Augmented Generation via KV Caching and Decoding
by: Srinivas, Sakhinana Sagar, et al.
Published: (2025)
by: Srinivas, Sakhinana Sagar, et al.
Published: (2025)
MadaKV: Adaptive Modality-Perception KV Cache Eviction for Efficient Multimodal Long-Context Inference
by: Li, Kunxi, et al.
Published: (2025)
by: Li, Kunxi, et al.
Published: (2025)
IsoQuant: Hardware-Aligned SO(4) Isoclinic Rotations for LLM KV Cache Compression
by: Ji, Zhongping
Published: (2026)
by: Ji, Zhongping
Published: (2026)
TTKV: Temporal-Tiered KV Cache for Long-Context LLM Inference
by: Dzikanyanga, Gradwell, et al.
Published: (2026)
by: Dzikanyanga, Gradwell, et al.
Published: (2026)
KV-Fold: One-Step KV-Cache Recurrence for Long-Context Inference
by: Nadali, Alireza, et al.
Published: (2026)
by: Nadali, Alireza, et al.
Published: (2026)
HCAttention: Extreme KV Cache Compression via Heterogeneous Attention Computing for LLMs
by: Yang, Dongquan, et al.
Published: (2025)
by: Yang, Dongquan, et al.
Published: (2025)
EliteKV: Scalable KV Cache Compression via RoPE Frequency Selection and Joint Low-Rank Projection
by: Zhou, Yuhao, et al.
Published: (2025)
by: Zhou, Yuhao, et al.
Published: (2025)
Scaling Sparse Fine-Tuning to Large Language Models
by: Ansell, Alan, et al.
Published: (2024)
by: Ansell, Alan, et al.
Published: (2024)
KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing
by: Yang, Yifei, et al.
Published: (2024)
by: Yang, Yifei, et al.
Published: (2024)
One Size Does Not Fit All: Token-Wise Adaptive Compression for KV Cache
by: Lu, Liming, et al.
Published: (2026)
by: Lu, Liming, et al.
Published: (2026)
Probing the Emergence of Cross-lingual Alignment during LLM Training
by: Wang, Hetong, et al.
Published: (2024)
by: Wang, Hetong, et al.
Published: (2024)
MoE-nD: Per-Layer Mixture-of-Experts Routing for Multi-Axis KV Cache Compression
by: Sun, Libo, et al.
Published: (2026)
by: Sun, Libo, et al.
Published: (2026)
LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy
by: Zhang, Rongzhi, et al.
Published: (2024)
by: Zhang, Rongzhi, et al.
Published: (2024)
Sparse Attention across Multiple-context KV Cache
by: Cao, Ziyi, et al.
Published: (2025)
by: Cao, Ziyi, et al.
Published: (2025)
Similar Items
-
KV Cache Transform Coding for Compact Storage in LLM Inference
by: Staniszewski, Konrad, et al.
Published: (2025) -
Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference
by: Nawrot, Piotr, et al.
Published: (2024) -
The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs
by: Nawrot, Piotr, et al.
Published: (2025) -
EvolKV: Evolutionary KV Cache Compression for LLM Inference
by: Yu, Bohan, et al.
Published: (2025) -
RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression
by: Behnam, Payman, et al.
Published: (2025)