Saved in:
| Main Authors: | Su, Zunhai, Shen, Wang, Li, Linge, Chen, Zhe, Wei, Hanyu, Yu, Huangqi, Yuan, Kehong |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2501.15021 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
RotateKV: Accurate and Robust 2-Bit KV Cache Quantization for LLMs via Outlier-Aware Adaptive Rotations
by: Su, Zunhai, et al.
Published: (2025)
by: Su, Zunhai, et al.
Published: (2025)
KVSink: Understanding and Enhancing the Preservation of Attention Sinks in KV Cache Quantization for LLMs
by: Su, Zunhai, et al.
Published: (2025)
by: Su, Zunhai, et al.
Published: (2025)
VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration
by: Tu, Dezhan, et al.
Published: (2024)
by: Tu, Dezhan, et al.
Published: (2024)
AnTKV: Anchor Token-Aware Sub-Bit Vector Quantization for KV Cache in Large Language Models
by: Li, Zeyu, et al.
Published: (2025)
by: Li, Zeyu, et al.
Published: (2025)
QAQ: Quality Adaptive Quantization for LLM KV Cache
by: Dong, Shichen, et al.
Published: (2024)
by: Dong, Shichen, et al.
Published: (2024)
Unveiling Super Experts in Mixture-of-Experts Large Language Models
by: Su, Zunhai, et al.
Published: (2025)
by: Su, Zunhai, et al.
Published: (2025)
QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead
by: Zandieh, Amir, et al.
Published: (2024)
by: Zandieh, Amir, et al.
Published: (2024)
XQuant: Achieving Ultra-Low Bit KV Cache Quantization with Cross-Layer Compression
by: Yang, Haoqi, et al.
Published: (2025)
by: Yang, Haoqi, et al.
Published: (2025)
LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior Accuracy Preservation
by: Chen, Han, et al.
Published: (2025)
by: Chen, Han, et al.
Published: (2025)
DynamicKV: Task-Aware Adaptive KV Cache Compression for Long Context LLMs
by: Zhou, Xiabin, et al.
Published: (2024)
by: Zhou, Xiabin, et al.
Published: (2024)
VecInfer: Efficient LLM Inference with Low-Bit KV Cache via Outlier-Suppressed Vector Quantization
by: Yao, Dingyu, et al.
Published: (2025)
by: Yao, Dingyu, et al.
Published: (2025)
$A^3$: Attention-Aware Accurate KV Cache Fusion for Fast Large Language Model Serving
by: Zhou, Yuechi, et al.
Published: (2025)
by: Zhou, Yuechi, et al.
Published: (2025)
Accurate KV Cache Quantization with Outlier Tokens Tracing
by: Su, Yi, et al.
Published: (2025)
by: Su, Yi, et al.
Published: (2025)
InnerQ: Hardware-Aware Tuning-Free Quantization of KV Cache for Large Language Models
by: Hosseini, Sayed Mohammadreza Tayaranian, et al.
Published: (2026)
by: Hosseini, Sayed Mohammadreza Tayaranian, et al.
Published: (2026)
AhaKV: Adaptive Holistic Attention-Driven KV Cache Eviction for Efficient Inference of Large Language Models
by: Gu, Yifeng, et al.
Published: (2025)
by: Gu, Yifeng, et al.
Published: (2025)
dKV-Cache: The Cache for Diffusion Language Models
by: Ma, Xinyin, et al.
Published: (2025)
by: Ma, Xinyin, et al.
Published: (2025)
AttentionPredictor: Temporal Patterns Matter for KV Cache Compression
by: Yang, Qingyue, et al.
Published: (2025)
by: Yang, Qingyue, et al.
Published: (2025)
G-KV: Decoding-Time KV Cache Eviction with Global Attention
by: Liao, Mengqi, et al.
Published: (2025)
by: Liao, Mengqi, et al.
Published: (2025)
WeightedKV: Attention Scores Weighted Key-Value Cache Merging for Large Language Models
by: Yuan, Jian, et al.
Published: (2025)
by: Yuan, Jian, et al.
Published: (2025)
SABlock: Semantic-Aware KV Cache Eviction with Adaptive Compression Block Size
by: Chen, Jinhan, et al.
Published: (2025)
by: Chen, Jinhan, et al.
Published: (2025)
Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity
by: Ma, Da, et al.
Published: (2024)
by: Ma, Da, et al.
Published: (2024)
Attention Is All You Need for KV Cache in Diffusion LLMs
by: Nguyen-Tri, Quan, et al.
Published: (2025)
by: Nguyen-Tri, Quan, et al.
Published: (2025)
SQuat: Subspace-orthogonal KV Cache Quantization
by: Wang, Hao, et al.
Published: (2025)
by: Wang, Hao, et al.
Published: (2025)
KV Shifting Attention Enhances Language Modeling
by: Xu, Mingyu, et al.
Published: (2024)
by: Xu, Mingyu, et al.
Published: (2024)
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
by: Liu, Zirui, et al.
Published: (2024)
by: Liu, Zirui, et al.
Published: (2024)
OjaKV: Context-Aware Online Low-Rank KV Cache Compression
by: Zhu, Yuxuan, et al.
Published: (2025)
by: Zhu, Yuxuan, et al.
Published: (2025)
MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache
by: Sharma, Akshat, et al.
Published: (2024)
by: Sharma, Akshat, et al.
Published: (2024)
SnapMLA: Efficient Long-Context MLA Decoding via Hardware-Aware FP8 Quantized Pipelining
by: Zhang, Yifan, et al.
Published: (2026)
by: Zhang, Yifan, et al.
Published: (2026)
WindowKV: Task-Adaptive Group-Wise KV Cache Window Selection for Efficient LLM Inference
by: Zuo, Youhui, et al.
Published: (2025)
by: Zuo, Youhui, et al.
Published: (2025)
H1B-KV: Hybrid One-Bit Caches for Memory-Efficient Large Language Model Inference
by: Vejendla, Harshil
Published: (2025)
by: Vejendla, Harshil
Published: (2025)
BitDecoding: Unlocking Tensor Cores for Long-Context LLMs with Low-Bit KV Cache
by: Du, Dayou, et al.
Published: (2025)
by: Du, Dayou, et al.
Published: (2025)
Task-KV: Task-aware KV Cache Optimization via Semantic Differentiation of Attention Heads
by: He, Xingyang, et al.
Published: (2025)
by: He, Xingyang, et al.
Published: (2025)
Quantization Dominates Rank Reduction for KV-Cache Compression
by: Salfati, Samuel
Published: (2026)
by: Salfati, Samuel
Published: (2026)
ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition
by: Ye, Lu, et al.
Published: (2024)
by: Ye, Lu, et al.
Published: (2024)
Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference
by: Feng, Yuan, et al.
Published: (2024)
by: Feng, Yuan, et al.
Published: (2024)
CommVQ: Commutative Vector Quantization for KV Cache Compression
by: Li, Junyan, et al.
Published: (2025)
by: Li, Junyan, et al.
Published: (2025)
AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations
by: Tao, Qian, et al.
Published: (2024)
by: Tao, Qian, et al.
Published: (2024)
Unmasking the Shadows of AI: Investigating Deceptive Capabilities in Large Language Models
by: Guo, Linge
Published: (2024)
by: Guo, Linge
Published: (2024)
LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models
by: Shi, Dachuan, et al.
Published: (2025)
by: Shi, Dachuan, et al.
Published: (2025)
PlanGPT-VL: Enhancing Urban Planning with Domain-Specific Vision-Language Models
by: Zhu, He, et al.
Published: (2025)
by: Zhu, He, et al.
Published: (2025)
Similar Items
-
RotateKV: Accurate and Robust 2-Bit KV Cache Quantization for LLMs via Outlier-Aware Adaptive Rotations
by: Su, Zunhai, et al.
Published: (2025) -
KVSink: Understanding and Enhancing the Preservation of Attention Sinks in KV Cache Quantization for LLMs
by: Su, Zunhai, et al.
Published: (2025) -
VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration
by: Tu, Dezhan, et al.
Published: (2024) -
AnTKV: Anchor Token-Aware Sub-Bit Vector Quantization for KV Cache in Large Language Models
by: Li, Zeyu, et al.
Published: (2025) -
QAQ: Quality Adaptive Quantization for LLM KV Cache
by: Dong, Shichen, et al.
Published: (2024)