Saved in:
| Main Authors: | Xia, Haojun, Wu, Xiaoxia, Li, Jisen, Wu, Robert, Wang, Junxiong, Wang, Jue, Li, Chenxi, Singhal, Aman, Shah, Alay Dilipbhai, Ariyak, Alpay, Zhuang, Donglin, Zhou, Zhongzhu, Athiwaratkun, Ben, Zheng, Zhen, Song, Shuaiwen Leon |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2511.18643 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization
by: Zhou, Zhongzhu, et al.
Published: (2026)
by: Zhou, Zhongzhu, et al.
Published: (2026)
SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving
by: Jia, Jinda, et al.
Published: (2026)
by: Jia, Jinda, et al.
Published: (2026)
CARE: Covariance-Aware and Rank-Enhanced Decomposition for Enabling Multi-Head Latent Attention
by: Zhou, Zhongzhu, et al.
Published: (2026)
by: Zhou, Zhongzhu, et al.
Published: (2026)
Imitate Optimal Policy: Prevail and Induce Action Collapse in Policy Gradient
by: Zhou, Zhongzhu, et al.
Published: (2025)
by: Zhou, Zhongzhu, et al.
Published: (2025)
Understanding and Steering the Cognitive Behaviors of Reasoning Models at Test-Time
by: Zhang, Zhenyu, et al.
Published: (2025)
by: Zhang, Zhenyu, et al.
Published: (2025)
Beat the long tail: Distribution-Aware Speculative Decoding for RL Training
by: Shao, Zelei, et al.
Published: (2025)
by: Shao, Zelei, et al.
Published: (2025)
When RL Meets Adaptive Speculative Training: A Unified Training-Serving System
by: Wang, Junxiong, et al.
Published: (2026)
by: Wang, Junxiong, et al.
Published: (2026)
Introspective Diffusion Language Models
by: Yu, Yifan, et al.
Published: (2026)
by: Yu, Yifan, et al.
Published: (2026)
Accurate KV Cache Quantization with Outlier Tokens Tracing
by: Su, Yi, et al.
Published: (2025)
by: Su, Yi, et al.
Published: (2025)
$V_1$: Unifying Generation and Self-Verification for Parallel Reasoners
by: Singh, Harman, et al.
Published: (2026)
by: Singh, Harman, et al.
Published: (2026)
ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification
by: He, Yefei, et al.
Published: (2024)
by: He, Yefei, et al.
Published: (2024)
FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design
by: Xia, Haojun, et al.
Published: (2024)
by: Xia, Haojun, et al.
Published: (2024)
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
by: Liu, Zirui, et al.
Published: (2024)
by: Liu, Zirui, et al.
Published: (2024)
Ladder-residual: parallelism-aware architecture for accelerating large model inference with communication overlapping
by: Zhang, Muru, et al.
Published: (2025)
by: Zhang, Muru, et al.
Published: (2025)
Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression
by: Liu, Peiyu, et al.
Published: (2024)
by: Liu, Peiyu, et al.
Published: (2024)
RotateKV: Accurate and Robust 2-Bit KV Cache Quantization for LLMs via Outlier-Aware Adaptive Rotations
by: Su, Zunhai, et al.
Published: (2025)
by: Su, Zunhai, et al.
Published: (2025)
LayerKV: Optimizing Large Language Model Serving with Layer-wise KV Cache Management
by: Xiong, Yi, et al.
Published: (2024)
by: Xiong, Yi, et al.
Published: (2024)
Data Diversification Methods In Alignment Enhance Math Performance In LLMs
by: Dokmeci, Berkan, et al.
Published: (2025)
by: Dokmeci, Berkan, et al.
Published: (2025)
KVmix: Gradient-Based Layer Importance-Aware Mixed-Precision Quantization for KV Cache
by: Li, Fei, et al.
Published: (2025)
by: Li, Fei, et al.
Published: (2025)
ReST-KV: Robust KV Cache Eviction with Layer-wise Output Reconstruction and Spatial-Temporal Smoothing
by: An, Yongqi, et al.
Published: (2026)
by: An, Yongqi, et al.
Published: (2026)
TriAxialKV: Toward Extreme Low-Precision KV-Cache Quantization for Agentic Inference Tasks
by: Shen, Hanzhang, et al.
Published: (2026)
by: Shen, Hanzhang, et al.
Published: (2026)
CorDA: Context-Oriented Decomposition Adaptation of Large Language Models for Task-Aware Parameter-Efficient Fine-tuning
by: Yang, Yibo, et al.
Published: (2024)
by: Yang, Yibo, et al.
Published: (2024)
MergeQuant: Accurate 4-bit Static Quantization of Large Language Models by Channel-wise Calibration
by: Wang, Jinguang, et al.
Published: (2025)
by: Wang, Jinguang, et al.
Published: (2025)
Squeeze Evolve: Unified Multi-Model Orchestration for Verifier-Free Evolution
by: Maheswaran, Monishwaran, et al.
Published: (2026)
by: Maheswaran, Monishwaran, et al.
Published: (2026)
SQuat: Subspace-orthogonal KV Cache Quantization
by: Wang, Hao, et al.
Published: (2025)
by: Wang, Hao, et al.
Published: (2025)
Think Deep, Think Fast: Investigating Efficiency of Verifier-free Inference-time-scaling Methods
by: Wang, Junlin, et al.
Published: (2025)
by: Wang, Junlin, et al.
Published: (2025)
CommVQ: Commutative Vector Quantization for KV Cache Compression
by: Li, Junyan, et al.
Published: (2025)
by: Li, Junyan, et al.
Published: (2025)
Kitti_to_hdmapping_seq00
by: Chudziński
Published: (2026)
by: Chudziński
Published: (2026)
Improving Model Alignment Through Collective Intelligence of Open-Source LLMS
by: Wang, Junlin, et al.
Published: (2025)
by: Wang, Junlin, et al.
Published: (2025)
AlignedKV: Reducing Memory Access of KV-Cache with Precision-Aligned Quantization
by: Tan, Yifan, et al.
Published: (2024)
by: Tan, Yifan, et al.
Published: (2024)
AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations
by: Tao, Qian, et al.
Published: (2024)
by: Tao, Qian, et al.
Published: (2024)
LAVa: Layer-wise KV Cache Eviction with Dynamic Budget Allocation
by: Shen, Yiqun, et al.
Published: (2025)
by: Shen, Yiqun, et al.
Published: (2025)
Hurwitz Quaternion Multiplicative Quantization for KV Cache Compression
by: Swain, Kabir, et al.
Published: (2026)
by: Swain, Kabir, et al.
Published: (2026)
QAQ: Quality Adaptive Quantization for LLM KV Cache
by: Dong, Shichen, et al.
Published: (2024)
by: Dong, Shichen, et al.
Published: (2024)
PolarQuant: Quantizing KV Caches with Polar Transformation
by: Han, Insu, et al.
Published: (2025)
by: Han, Insu, et al.
Published: (2025)
Quantization Dominates Rank Reduction for KV-Cache Compression
by: Salfati, Samuel
Published: (2026)
by: Salfati, Samuel
Published: (2026)
ContiguousKV: Accelerating LLM Prefill with Granularity-Aligned KV Cache Management
by: Zou, Jing, et al.
Published: (2026)
by: Zou, Jing, et al.
Published: (2026)
OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization
by: Li, Zhikai, et al.
Published: (2026)
by: Li, Zhikai, et al.
Published: (2026)
RDKV: Rate-Distortion Bit Allocation for Joint Eviction and Quantization of the KV Cache
by: Zhang, Junkai, et al.
Published: (2026)
by: Zhang, Junkai, et al.
Published: (2026)
LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation
by: Ahn, Jinwoo, et al.
Published: (2026)
by: Ahn, Jinwoo, et al.
Published: (2026)
Similar Items
-
OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization
by: Zhou, Zhongzhu, et al.
Published: (2026) -
SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving
by: Jia, Jinda, et al.
Published: (2026) -
CARE: Covariance-Aware and Rank-Enhanced Decomposition for Enabling Multi-Head Latent Attention
by: Zhou, Zhongzhu, et al.
Published: (2026) -
Imitate Optimal Policy: Prevail and Induce Action Collapse in Policy Gradient
by: Zhou, Zhongzhu, et al.
Published: (2025) -
Understanding and Steering the Cognitive Behaviors of Reasoning Models at Test-Time
by: Zhang, Zhenyu, et al.
Published: (2025)