Saved in:
| Main Authors: | Mu, Junlin, Huang, Hantao, Zhang, Jihang, Yu, Minghui, Wang, Tao, Li, Yidong |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2510.24273 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
QET: Enhancing Quantized LLM Parameters and KV cache Compression through Element Substitution and Residual Clustering
by: Wang, Yanshu, et al.
Published: (2024)
by: Wang, Yanshu, et al.
Published: (2024)
An experimental study of KV cache reuse strategies in chunk-level caching systems
by: Cestola, Samuel, et al.
Published: (2026)
by: Cestola, Samuel, et al.
Published: (2026)
IceCache: Memory-efficient KV-cache Management for Long-Sequence LLMs
by: Mao, Yuzhen, et al.
Published: (2026)
by: Mao, Yuzhen, et al.
Published: (2026)
Residual vector quantization for KV cache compression in large language model
by: Kumar, Ankur
Published: (2024)
by: Kumar, Ankur
Published: (2024)
Eigen Attention: Attention in Low-Rank Space for KV Cache Compression
by: Saxena, Utkarsh, et al.
Published: (2024)
by: Saxena, Utkarsh, et al.
Published: (2024)
AttentionPredictor: Temporal Patterns Matter for KV Cache Compression
by: Yang, Qingyue, et al.
Published: (2025)
by: Yang, Qingyue, et al.
Published: (2025)
ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression
by: Liu, Guangda, et al.
Published: (2024)
by: Liu, Guangda, et al.
Published: (2024)
BaKlaVa -- Budgeted Allocation of KV cache for Long-context Inference
by: Gulhan, Ahmed Burak, et al.
Published: (2025)
by: Gulhan, Ahmed Burak, et al.
Published: (2025)
Sparse Attention across Multiple-context KV Cache
by: Cao, Ziyi, et al.
Published: (2025)
by: Cao, Ziyi, et al.
Published: (2025)
KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems
by: Ye, Hancheng, et al.
Published: (2025)
by: Ye, Hancheng, et al.
Published: (2025)
KaVa: Latent Reasoning via Compressed KV-Cache Distillation
by: Kuzina, Anna, et al.
Published: (2025)
by: Kuzina, Anna, et al.
Published: (2025)
ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition
by: Ye, Lu, et al.
Published: (2024)
by: Ye, Lu, et al.
Published: (2024)
SWAN: Sparse Winnowed Attention for Reduced Inference Memory via Decompression-Free KV-Cache Compression
by: S, Santhosh G, et al.
Published: (2025)
by: S, Santhosh G, et al.
Published: (2025)
KQ-SVD: Compressing the KV Cache with Provable Guarantees on Attention Fidelity
by: Lesens, Damien, et al.
Published: (2025)
by: Lesens, Damien, et al.
Published: (2025)
RazorAttention: Efficient KV Cache Compression Through Retrieval Heads
by: Tang, Hanlin, et al.
Published: (2024)
by: Tang, Hanlin, et al.
Published: (2024)
Memory Inception: Latent-Space KV Cache Manipulation for Steering LLMs
by: Liu, Andy Zeyi, et al.
Published: (2026)
by: Liu, Andy Zeyi, et al.
Published: (2026)
CommonKV: Compressing KV Cache with Cross-layer Parameter Sharing
by: Wang, Yixuan, et al.
Published: (2025)
by: Wang, Yixuan, et al.
Published: (2025)
Lexico: Extreme KV Cache Compression via Sparse Coding over Universal Dictionaries
by: Kim, Junhyuck, et al.
Published: (2024)
by: Kim, Junhyuck, et al.
Published: (2024)
HCAttention: Extreme KV Cache Compression via Heterogeneous Attention Computing for LLMs
by: Yang, Dongquan, et al.
Published: (2025)
by: Yang, Dongquan, et al.
Published: (2025)
CompilerKV: Risk-Adaptive KV Compression via Offline Experience Compilation
by: Yang, Ning, et al.
Published: (2026)
by: Yang, Ning, et al.
Published: (2026)
EvolKV: Evolutionary KV Cache Compression for LLM Inference
by: Yu, Bohan, et al.
Published: (2025)
by: Yu, Bohan, et al.
Published: (2025)
Palu: Compressing KV-Cache with Low-Rank Projection
by: Chang, Chi-Chih, et al.
Published: (2024)
by: Chang, Chi-Chih, et al.
Published: (2024)
Enhancing Large Multimodal Models with Adaptive Sparsity and KV Cache Compression
by: Zhang, Te, et al.
Published: (2025)
by: Zhang, Te, et al.
Published: (2025)
KV-CAR: KV Cache Compression using Autoencoders and KV Reuse in Large Language Models
by: Roy, Sourjya, et al.
Published: (2025)
by: Roy, Sourjya, et al.
Published: (2025)
Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache
by: Dehghankar, Mohsen, et al.
Published: (2026)
by: Dehghankar, Mohsen, et al.
Published: (2026)
Training Transformers for KV Cache Compressibility
by: Gelberg, Yoav, et al.
Published: (2026)
by: Gelberg, Yoav, et al.
Published: (2026)
ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration
by: Yan, Xianglong, et al.
Published: (2025)
by: Yan, Xianglong, et al.
Published: (2025)
The Pitfalls of KV Cache Compression
by: Chen, Alex, et al.
Published: (2025)
by: Chen, Alex, et al.
Published: (2025)
RetentiveKV: State-Space Memory for Uncertainty-Aware Multimodal KV Cache Eviction
by: Liu, Sihao, et al.
Published: (2026)
by: Liu, Sihao, et al.
Published: (2026)
ThinKV: Thought-Adaptive KV Cache Compression for Efficient Reasoning Models
by: Ramachandran, Akshat, et al.
Published: (2025)
by: Ramachandran, Akshat, et al.
Published: (2025)
SmallKV: Small Model Assisted Compensation of KV Cache Compression for Efficient LLM Inference
by: Zhao, Yi, et al.
Published: (2025)
by: Zhao, Yi, et al.
Published: (2025)
GrassNet: State Space Model Meets Graph Neural Network
by: Zhao, Gongpei, et al.
Published: (2024)
by: Zhao, Gongpei, et al.
Published: (2024)
xKV: Cross-Layer KV-Cache Compression via Aligned Singular Vector Extraction
by: Chang, Chi-Chih, et al.
Published: (2025)
by: Chang, Chi-Chih, et al.
Published: (2025)
MatryoshkaKV: Adaptive KV Compression via Trainable Orthogonal Projection
by: Lin, Bokai, et al.
Published: (2024)
by: Lin, Bokai, et al.
Published: (2024)
OjaKV: Context-Aware Online Low-Rank KV Cache Compression
by: Zhu, Yuxuan, et al.
Published: (2025)
by: Zhu, Yuxuan, et al.
Published: (2025)
Forget, Then Recall: Learnable Compression and Selective Unfolding via Gist Sparse Attention
by: Mao, Yuzhen, et al.
Published: (2026)
by: Mao, Yuzhen, et al.
Published: (2026)
Adaptive Compression of the Latent Space in Variational Autoencoders
by: Sejnova, Gabriela, et al.
Published: (2023)
by: Sejnova, Gabriela, et al.
Published: (2023)
LongFlow: Efficient KV Cache Compression for Reasoning Models
by: Su, Yi, et al.
Published: (2026)
by: Su, Yi, et al.
Published: (2026)
APTQ: Attention-aware Post-Training Mixed-Precision Quantization for Large Language Models
by: Guan, Ziyi, et al.
Published: (2024)
by: Guan, Ziyi, et al.
Published: (2024)
RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression
by: Behnam, Payman, et al.
Published: (2025)
by: Behnam, Payman, et al.
Published: (2025)
Similar Items
-
QET: Enhancing Quantized LLM Parameters and KV cache Compression through Element Substitution and Residual Clustering
by: Wang, Yanshu, et al.
Published: (2024) -
An experimental study of KV cache reuse strategies in chunk-level caching systems
by: Cestola, Samuel, et al.
Published: (2026) -
IceCache: Memory-efficient KV-cache Management for Long-Sequence LLMs
by: Mao, Yuzhen, et al.
Published: (2026) -
Residual vector quantization for KV cache compression in large language model
by: Kumar, Ankur
Published: (2024) -
Eigen Attention: Attention in Low-Rank Space for KV Cache Compression
by: Saxena, Utkarsh, et al.
Published: (2024)