:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Li, Zeyu, Xiao, Chuanfu, Wang, Yang, Liu, Xiang, Tang, Zhenheng, Lu, Baotong, Yang, Mao, Chen, Xinyu, Chu, Xiaowen
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2506.19505
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference
by: Liu, Xiang, et al.
Published: (2025)

Semantic Integrity Matters: Benchmarking and Preserving High-Density Reasoning in KV Cache Compression
by: Liu, Xiang, et al.
Published: (2025)

Accurate KV Cache Quantization with Outlier Tokens Tracing
by: Su, Yi, et al.
Published: (2025)

VecInfer: Efficient LLM Inference with Low-Bit KV Cache via Outlier-Suppressed Vector Quantization
by: Yao, Dingyu, et al.
Published: (2025)

Plug-and-Play 1.x-Bit KV Cache Quantization for Video Large Language Models
by: Tao, Keda, et al.
Published: (2025)

KEEP: A KV-Cache-Centric Memory Management System for Efficient Embodied Planning
by: Yang, Zebin, et al.
Published: (2026)

AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations
by: Tao, Qian, et al.
Published: (2024)

NSNQuant: A Double Normalization Approach for Calibration-Free Low-Bit Vector Quantization of KV Cache
by: Son, Donghyun, et al.
Published: (2025)

CommVQ: Commutative Vector Quantization for KV Cache Compression
by: Li, Junyan, et al.
Published: (2025)

XQuant: Achieving Ultra-Low Bit KV Cache Quantization with Cross-Layer Compression
by: Yang, Haoqi, et al.
Published: (2025)

BitDecoding: Unlocking Tensor Cores for Long-Context LLMs with Low-Bit KV Cache
by: Du, Dayou, et al.
Published: (2025)

No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization
by: Yang, June Yong, et al.
Published: (2024)

RotateKV: Accurate and Robust 2-Bit KV Cache Quantization for LLMs via Outlier-Aware Adaptive Rotations
by: Su, Zunhai, et al.
Published: (2025)

CalibQuant: 1-Bit KV Cache Quantization for Multimodal LLMs
by: Han, Insu, et al.
Published: (2025)

QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead
by: Zandieh, Amir, et al.
Published: (2024)

On the Spectral Flattening of Quantized Embeddings
by: Huang, Junlin, et al.
Published: (2026)

AKVQ-VL: Attention-Aware KV Cache Adaptive 2-Bit Quantization for Vision-Language Models
by: Su, Zunhai, et al.
Published: (2025)

ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification
by: He, Yefei, et al.
Published: (2024)

FlowKV: Enhancing Multi-Turn Conversational Coherence in LLMs via Isolated Key-Value Cache Management
by: Liu, Xiang, et al.
Published: (2025)

Don't Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs
by: Boroujeni, Sayed Pedram Haeri, et al.
Published: (2026)

RDKV: Rate-Distortion Bit Allocation for Joint Eviction and Quantization of the KV Cache
by: Zhang, Junkai, et al.
Published: (2026)

LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior Accuracy Preservation
by: Chen, Han, et al.
Published: (2025)

Should We Really Edit Language Models? On the Evaluation of Edited Language Models
by: Li, Qi, et al.
Published: (2024)

KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization
by: Zhang, Tianyi, et al.
Published: (2024)

SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving
by: Jia, Jinda, et al.
Published: (2026)

MorphServe: Efficient and Workload-Aware LLM Serving via Runtime Quantized Layer Swapping and KV Cache Resizing
by: Su, Zhaoyuan, et al.
Published: (2025)

Position: LLM Inference Should Be Evaluated as Energy-to-Token Production
by: Liu, Xiang, et al.
Published: (2026)

ExpertFlow: Efficient Mixture-of-Experts Inference via Predictive Expert Caching and Token Scheduling
by: He, Xin, et al.
Published: (2024)

HOSCF: Efficient decoupling algorithms for finding the best rank-one approximation of higher-order tensors
by: Xiao, Chuanfu, et al.
Published: (2024)

OmniReview: A Large-scale Benchmark and LLM-enhanced Framework for Realistic Reviewer Recommendation
by: Huang, Yehua, et al.
Published: (2026)

BitDance: Scaling Autoregressive Generative Models with Binary Tokens
by: Ai, Yuang, et al.
Published: (2026)

FibQuant: Universal Vector Quantization for Random-Access KV-Cache Compression
by: Lee, Namyoon, et al.
Published: (2026)

Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization
by: Xi, Haocheng, et al.
Published: (2026)

CTkvr: KV Cache Retrieval for Long-Context LLMs via Centroid then Token Indexing
by: Lu, Kuan, et al.
Published: (2025)

Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for Large Language Models
by: Dong, Peijie, et al.
Published: (2024)

A Token/KV-Cache Communication Media Selection and Resource Allocation Strategy for Multi-Agent Collaboration
by: Dai, Lipeng, et al.
Published: (2026)

Titanus: Enabling KV Cache Pruning and Quantization On-the-Fly for LLM Acceleration
by: Chen, Peilin, et al.
Published: (2025)

Subkv: Quantizing Long Context KV Cache for Sub‐Billion Parameter Language Models on Edge Devices
by: Ziqian Zeng, et al.
Published: (2025)

VMRNN: Integrating Vision Mamba and LSTM for Efficient and Accurate Spatiotemporal Forecasting
by: Tang, Yujin, et al.
Published: (2024)

Can Compressed LLMs Truly Act? An Empirical Evaluation of Agentic Capabilities in LLM Compression
by: Dong, Peijie, et al.
Published: (2025)