:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Su, Zunhai, Shen, Wang, Li, Linge, Chen, Zhe, Wei, Hanyu, Yu, Huangqi, Yuan, Kehong
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2501.15021
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

RotateKV: Accurate and Robust 2-Bit KV Cache Quantization for LLMs via Outlier-Aware Adaptive Rotations
by: Su, Zunhai, et al.
Published: (2025)

KVSink: Understanding and Enhancing the Preservation of Attention Sinks in KV Cache Quantization for LLMs
by: Su, Zunhai, et al.
Published: (2025)

VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration
by: Tu, Dezhan, et al.
Published: (2024)

AnTKV: Anchor Token-Aware Sub-Bit Vector Quantization for KV Cache in Large Language Models
by: Li, Zeyu, et al.
Published: (2025)

QAQ: Quality Adaptive Quantization for LLM KV Cache
by: Dong, Shichen, et al.
Published: (2024)

Unveiling Super Experts in Mixture-of-Experts Large Language Models
by: Su, Zunhai, et al.
Published: (2025)

QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead
by: Zandieh, Amir, et al.
Published: (2024)

XQuant: Achieving Ultra-Low Bit KV Cache Quantization with Cross-Layer Compression
by: Yang, Haoqi, et al.
Published: (2025)

LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior Accuracy Preservation
by: Chen, Han, et al.
Published: (2025)

DynamicKV: Task-Aware Adaptive KV Cache Compression for Long Context LLMs
by: Zhou, Xiabin, et al.
Published: (2024)

VecInfer: Efficient LLM Inference with Low-Bit KV Cache via Outlier-Suppressed Vector Quantization
by: Yao, Dingyu, et al.
Published: (2025)

$A^3$: Attention-Aware Accurate KV Cache Fusion for Fast Large Language Model Serving
by: Zhou, Yuechi, et al.
Published: (2025)

Accurate KV Cache Quantization with Outlier Tokens Tracing
by: Su, Yi, et al.
Published: (2025)

InnerQ: Hardware-Aware Tuning-Free Quantization of KV Cache for Large Language Models
by: Hosseini, Sayed Mohammadreza Tayaranian, et al.
Published: (2026)

AhaKV: Adaptive Holistic Attention-Driven KV Cache Eviction for Efficient Inference of Large Language Models
by: Gu, Yifeng, et al.
Published: (2025)

dKV-Cache: The Cache for Diffusion Language Models
by: Ma, Xinyin, et al.
Published: (2025)

AttentionPredictor: Temporal Patterns Matter for KV Cache Compression
by: Yang, Qingyue, et al.
Published: (2025)

G-KV: Decoding-Time KV Cache Eviction with Global Attention
by: Liao, Mengqi, et al.
Published: (2025)

WeightedKV: Attention Scores Weighted Key-Value Cache Merging for Large Language Models
by: Yuan, Jian, et al.
Published: (2025)

SABlock: Semantic-Aware KV Cache Eviction with Adaptive Compression Block Size
by: Chen, Jinhan, et al.
Published: (2025)

Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity
by: Ma, Da, et al.
Published: (2024)

Attention Is All You Need for KV Cache in Diffusion LLMs
by: Nguyen-Tri, Quan, et al.
Published: (2025)

SQuat: Subspace-orthogonal KV Cache Quantization
by: Wang, Hao, et al.
Published: (2025)

KV Shifting Attention Enhances Language Modeling
by: Xu, Mingyu, et al.
Published: (2024)

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
by: Liu, Zirui, et al.
Published: (2024)

OjaKV: Context-Aware Online Low-Rank KV Cache Compression
by: Zhu, Yuxuan, et al.
Published: (2025)

MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache
by: Sharma, Akshat, et al.
Published: (2024)

SnapMLA: Efficient Long-Context MLA Decoding via Hardware-Aware FP8 Quantized Pipelining
by: Zhang, Yifan, et al.
Published: (2026)

WindowKV: Task-Adaptive Group-Wise KV Cache Window Selection for Efficient LLM Inference
by: Zuo, Youhui, et al.
Published: (2025)

H1B-KV: Hybrid One-Bit Caches for Memory-Efficient Large Language Model Inference
by: Vejendla, Harshil
Published: (2025)

BitDecoding: Unlocking Tensor Cores for Long-Context LLMs with Low-Bit KV Cache
by: Du, Dayou, et al.
Published: (2025)

Task-KV: Task-aware KV Cache Optimization via Semantic Differentiation of Attention Heads
by: He, Xingyang, et al.
Published: (2025)

Quantization Dominates Rank Reduction for KV-Cache Compression
by: Salfati, Samuel
Published: (2026)

ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition
by: Ye, Lu, et al.
Published: (2024)

Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference
by: Feng, Yuan, et al.
Published: (2024)

CommVQ: Commutative Vector Quantization for KV Cache Compression
by: Li, Junyan, et al.
Published: (2025)

AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations
by: Tao, Qian, et al.
Published: (2024)

Unmasking the Shadows of AI: Investigating Deceptive Capabilities in Large Language Models
by: Guo, Linge
Published: (2024)

LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models
by: Shi, Dachuan, et al.
Published: (2025)

PlanGPT-VL: Enhancing Urban Planning with Domain-Specific Vision-Language Models
by: Zhu, He, et al.
Published: (2025)