Saved in:
| Main Authors: | Li, Zeyu, Xiao, Chuanfu, Wang, Yang, Liu, Xiang, Tang, Zhenheng, Lu, Baotong, Yang, Mao, Chen, Xinyu, Chu, Xiaowen |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2506.19505 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference
by: Liu, Xiang, et al.
Published: (2025)
by: Liu, Xiang, et al.
Published: (2025)
Semantic Integrity Matters: Benchmarking and Preserving High-Density Reasoning in KV Cache Compression
by: Liu, Xiang, et al.
Published: (2025)
by: Liu, Xiang, et al.
Published: (2025)
Accurate KV Cache Quantization with Outlier Tokens Tracing
by: Su, Yi, et al.
Published: (2025)
by: Su, Yi, et al.
Published: (2025)
VecInfer: Efficient LLM Inference with Low-Bit KV Cache via Outlier-Suppressed Vector Quantization
by: Yao, Dingyu, et al.
Published: (2025)
by: Yao, Dingyu, et al.
Published: (2025)
Plug-and-Play 1.x-Bit KV Cache Quantization for Video Large Language Models
by: Tao, Keda, et al.
Published: (2025)
by: Tao, Keda, et al.
Published: (2025)
KEEP: A KV-Cache-Centric Memory Management System for Efficient Embodied Planning
by: Yang, Zebin, et al.
Published: (2026)
by: Yang, Zebin, et al.
Published: (2026)
AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations
by: Tao, Qian, et al.
Published: (2024)
by: Tao, Qian, et al.
Published: (2024)
NSNQuant: A Double Normalization Approach for Calibration-Free Low-Bit Vector Quantization of KV Cache
by: Son, Donghyun, et al.
Published: (2025)
by: Son, Donghyun, et al.
Published: (2025)
CommVQ: Commutative Vector Quantization for KV Cache Compression
by: Li, Junyan, et al.
Published: (2025)
by: Li, Junyan, et al.
Published: (2025)
XQuant: Achieving Ultra-Low Bit KV Cache Quantization with Cross-Layer Compression
by: Yang, Haoqi, et al.
Published: (2025)
by: Yang, Haoqi, et al.
Published: (2025)
BitDecoding: Unlocking Tensor Cores for Long-Context LLMs with Low-Bit KV Cache
by: Du, Dayou, et al.
Published: (2025)
by: Du, Dayou, et al.
Published: (2025)
No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization
by: Yang, June Yong, et al.
Published: (2024)
by: Yang, June Yong, et al.
Published: (2024)
RotateKV: Accurate and Robust 2-Bit KV Cache Quantization for LLMs via Outlier-Aware Adaptive Rotations
by: Su, Zunhai, et al.
Published: (2025)
by: Su, Zunhai, et al.
Published: (2025)
CalibQuant: 1-Bit KV Cache Quantization for Multimodal LLMs
by: Han, Insu, et al.
Published: (2025)
by: Han, Insu, et al.
Published: (2025)
QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead
by: Zandieh, Amir, et al.
Published: (2024)
by: Zandieh, Amir, et al.
Published: (2024)
On the Spectral Flattening of Quantized Embeddings
by: Huang, Junlin, et al.
Published: (2026)
by: Huang, Junlin, et al.
Published: (2026)
AKVQ-VL: Attention-Aware KV Cache Adaptive 2-Bit Quantization for Vision-Language Models
by: Su, Zunhai, et al.
Published: (2025)
by: Su, Zunhai, et al.
Published: (2025)
ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification
by: He, Yefei, et al.
Published: (2024)
by: He, Yefei, et al.
Published: (2024)
FlowKV: Enhancing Multi-Turn Conversational Coherence in LLMs via Isolated Key-Value Cache Management
by: Liu, Xiang, et al.
Published: (2025)
by: Liu, Xiang, et al.
Published: (2025)
Don't Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs
by: Boroujeni, Sayed Pedram Haeri, et al.
Published: (2026)
by: Boroujeni, Sayed Pedram Haeri, et al.
Published: (2026)
RDKV: Rate-Distortion Bit Allocation for Joint Eviction and Quantization of the KV Cache
by: Zhang, Junkai, et al.
Published: (2026)
by: Zhang, Junkai, et al.
Published: (2026)
LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior Accuracy Preservation
by: Chen, Han, et al.
Published: (2025)
by: Chen, Han, et al.
Published: (2025)
Should We Really Edit Language Models? On the Evaluation of Edited Language Models
by: Li, Qi, et al.
Published: (2024)
by: Li, Qi, et al.
Published: (2024)
KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization
by: Zhang, Tianyi, et al.
Published: (2024)
by: Zhang, Tianyi, et al.
Published: (2024)
SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving
by: Jia, Jinda, et al.
Published: (2026)
by: Jia, Jinda, et al.
Published: (2026)
MorphServe: Efficient and Workload-Aware LLM Serving via Runtime Quantized Layer Swapping and KV Cache Resizing
by: Su, Zhaoyuan, et al.
Published: (2025)
by: Su, Zhaoyuan, et al.
Published: (2025)
Position: LLM Inference Should Be Evaluated as Energy-to-Token Production
by: Liu, Xiang, et al.
Published: (2026)
by: Liu, Xiang, et al.
Published: (2026)
ExpertFlow: Efficient Mixture-of-Experts Inference via Predictive Expert Caching and Token Scheduling
by: He, Xin, et al.
Published: (2024)
by: He, Xin, et al.
Published: (2024)
HOSCF: Efficient decoupling algorithms for finding the best rank-one approximation of higher-order tensors
by: Xiao, Chuanfu, et al.
Published: (2024)
by: Xiao, Chuanfu, et al.
Published: (2024)
OmniReview: A Large-scale Benchmark and LLM-enhanced Framework for Realistic Reviewer Recommendation
by: Huang, Yehua, et al.
Published: (2026)
by: Huang, Yehua, et al.
Published: (2026)
BitDance: Scaling Autoregressive Generative Models with Binary Tokens
by: Ai, Yuang, et al.
Published: (2026)
by: Ai, Yuang, et al.
Published: (2026)
FibQuant: Universal Vector Quantization for Random-Access KV-Cache Compression
by: Lee, Namyoon, et al.
Published: (2026)
by: Lee, Namyoon, et al.
Published: (2026)
Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization
by: Xi, Haocheng, et al.
Published: (2026)
by: Xi, Haocheng, et al.
Published: (2026)
CTkvr: KV Cache Retrieval for Long-Context LLMs via Centroid then Token Indexing
by: Lu, Kuan, et al.
Published: (2025)
by: Lu, Kuan, et al.
Published: (2025)
Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for Large Language Models
by: Dong, Peijie, et al.
Published: (2024)
by: Dong, Peijie, et al.
Published: (2024)
A Token/KV-Cache Communication Media Selection and Resource Allocation Strategy for Multi-Agent Collaboration
by: Dai, Lipeng, et al.
Published: (2026)
by: Dai, Lipeng, et al.
Published: (2026)
Titanus: Enabling KV Cache Pruning and Quantization On-the-Fly for LLM Acceleration
by: Chen, Peilin, et al.
Published: (2025)
by: Chen, Peilin, et al.
Published: (2025)
Subkv: Quantizing Long Context KV Cache for Sub‐Billion Parameter Language Models on Edge Devices
by: Ziqian Zeng, et al.
Published: (2025)
by: Ziqian Zeng, et al.
Published: (2025)
VMRNN: Integrating Vision Mamba and LSTM for Efficient and Accurate Spatiotemporal Forecasting
by: Tang, Yujin, et al.
Published: (2024)
by: Tang, Yujin, et al.
Published: (2024)
Can Compressed LLMs Truly Act? An Empirical Evaluation of Agentic Capabilities in LLM Compression
by: Dong, Peijie, et al.
Published: (2025)
by: Dong, Peijie, et al.
Published: (2025)
Similar Items
-
ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference
by: Liu, Xiang, et al.
Published: (2025) -
Semantic Integrity Matters: Benchmarking and Preserving High-Density Reasoning in KV Cache Compression
by: Liu, Xiang, et al.
Published: (2025) -
Accurate KV Cache Quantization with Outlier Tokens Tracing
by: Su, Yi, et al.
Published: (2025) -
VecInfer: Efficient LLM Inference with Low-Bit KV Cache via Outlier-Suppressed Vector Quantization
by: Yao, Dingyu, et al.
Published: (2025) -
Plug-and-Play 1.x-Bit KV Cache Quantization for Video Large Language Models
by: Tao, Keda, et al.
Published: (2025)