:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Mu, Junlin, Huang, Hantao, Zhang, Jihang, Yu, Minghui, Wang, Tao, Li, Yidong
Format:	Preprint
Published:	2025
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2510.24273
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

QET: Enhancing Quantized LLM Parameters and KV cache Compression through Element Substitution and Residual Clustering
by: Wang, Yanshu, et al.
Published: (2024)

An experimental study of KV cache reuse strategies in chunk-level caching systems
by: Cestola, Samuel, et al.
Published: (2026)

IceCache: Memory-efficient KV-cache Management for Long-Sequence LLMs
by: Mao, Yuzhen, et al.
Published: (2026)

Residual vector quantization for KV cache compression in large language model
by: Kumar, Ankur
Published: (2024)

Eigen Attention: Attention in Low-Rank Space for KV Cache Compression
by: Saxena, Utkarsh, et al.
Published: (2024)

AttentionPredictor: Temporal Patterns Matter for KV Cache Compression
by: Yang, Qingyue, et al.
Published: (2025)

ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression
by: Liu, Guangda, et al.
Published: (2024)

BaKlaVa -- Budgeted Allocation of KV cache for Long-context Inference
by: Gulhan, Ahmed Burak, et al.
Published: (2025)

Sparse Attention across Multiple-context KV Cache
by: Cao, Ziyi, et al.
Published: (2025)

KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems
by: Ye, Hancheng, et al.
Published: (2025)

KaVa: Latent Reasoning via Compressed KV-Cache Distillation
by: Kuzina, Anna, et al.
Published: (2025)

ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition
by: Ye, Lu, et al.
Published: (2024)

SWAN: Sparse Winnowed Attention for Reduced Inference Memory via Decompression-Free KV-Cache Compression
by: S, Santhosh G, et al.
Published: (2025)

KQ-SVD: Compressing the KV Cache with Provable Guarantees on Attention Fidelity
by: Lesens, Damien, et al.
Published: (2025)

RazorAttention: Efficient KV Cache Compression Through Retrieval Heads
by: Tang, Hanlin, et al.
Published: (2024)

Memory Inception: Latent-Space KV Cache Manipulation for Steering LLMs
by: Liu, Andy Zeyi, et al.
Published: (2026)

CommonKV: Compressing KV Cache with Cross-layer Parameter Sharing
by: Wang, Yixuan, et al.
Published: (2025)

Lexico: Extreme KV Cache Compression via Sparse Coding over Universal Dictionaries
by: Kim, Junhyuck, et al.
Published: (2024)

HCAttention: Extreme KV Cache Compression via Heterogeneous Attention Computing for LLMs
by: Yang, Dongquan, et al.
Published: (2025)

CompilerKV: Risk-Adaptive KV Compression via Offline Experience Compilation
by: Yang, Ning, et al.
Published: (2026)

EvolKV: Evolutionary KV Cache Compression for LLM Inference
by: Yu, Bohan, et al.
Published: (2025)

Palu: Compressing KV-Cache with Low-Rank Projection
by: Chang, Chi-Chih, et al.
Published: (2024)

Enhancing Large Multimodal Models with Adaptive Sparsity and KV Cache Compression
by: Zhang, Te, et al.
Published: (2025)

KV-CAR: KV Cache Compression using Autoencoders and KV Reuse in Large Language Models
by: Roy, Sourjya, et al.
Published: (2025)

Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache
by: Dehghankar, Mohsen, et al.
Published: (2026)

Training Transformers for KV Cache Compressibility
by: Gelberg, Yoav, et al.
Published: (2026)

ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration
by: Yan, Xianglong, et al.
Published: (2025)

The Pitfalls of KV Cache Compression
by: Chen, Alex, et al.
Published: (2025)

RetentiveKV: State-Space Memory for Uncertainty-Aware Multimodal KV Cache Eviction
by: Liu, Sihao, et al.
Published: (2026)

ThinKV: Thought-Adaptive KV Cache Compression for Efficient Reasoning Models
by: Ramachandran, Akshat, et al.
Published: (2025)

SmallKV: Small Model Assisted Compensation of KV Cache Compression for Efficient LLM Inference
by: Zhao, Yi, et al.
Published: (2025)

GrassNet: State Space Model Meets Graph Neural Network
by: Zhao, Gongpei, et al.
Published: (2024)

xKV: Cross-Layer KV-Cache Compression via Aligned Singular Vector Extraction
by: Chang, Chi-Chih, et al.
Published: (2025)

MatryoshkaKV: Adaptive KV Compression via Trainable Orthogonal Projection
by: Lin, Bokai, et al.
Published: (2024)

OjaKV: Context-Aware Online Low-Rank KV Cache Compression
by: Zhu, Yuxuan, et al.
Published: (2025)

Forget, Then Recall: Learnable Compression and Selective Unfolding via Gist Sparse Attention
by: Mao, Yuzhen, et al.
Published: (2026)

Adaptive Compression of the Latent Space in Variational Autoencoders
by: Sejnova, Gabriela, et al.
Published: (2023)

LongFlow: Efficient KV Cache Compression for Reasoning Models
by: Su, Yi, et al.
Published: (2026)

APTQ: Attention-aware Post-Training Mixed-Precision Quantization for Large Language Models
by: Guan, Ziyi, et al.
Published: (2024)

RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression
by: Behnam, Payman, et al.
Published: (2025)