Saved in:
Bibliographic Details
Main Authors: Hariri, Mohsen, Luo, Alan, Chen, Weicong, Zhong, Shaochen, Zhang, Tianyi, Wang, Qifan, Hu, Xia, Han, Xiaotian, Chaudhary, Vipin
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2502.15075
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866909027395436544
author Hariri, Mohsen
Luo, Alan
Chen, Weicong
Zhong, Shaochen
Zhang, Tianyi
Wang, Qifan
Hu, Xia
Han, Xiaotian
Chaudhary, Vipin
author_facet Hariri, Mohsen
Luo, Alan
Chen, Weicong
Zhong, Shaochen
Zhang, Tianyi
Wang, Qifan
Hu, Xia
Han, Xiaotian
Chaudhary, Vipin
contents Large Language Models (LLMs) suffer inference-time memory bottlenecks dominated by the attention Key-Value (KV) cache, which scales with model size and context length. While KV-cache quantization alleviates this cost, bit allocation between keys and values is often tuned heuristically, lacking theoretical grounding and generalizability. This paper proposes two theorems that anchor mixed-precision KV quantization in the intrinsic geometry of Transformer models. First, key projections systematically have larger spectral and Frobenius norms than value matrices, implying higher information density along the key path. Second, for any given memory budget, prioritizing precision for keys over values strictly reduces quantization error and better preserves accuracy. Empirical evaluations across various prominent LLMs and benchmarks show that key-favored allocations (e.g., 4-bit keys, 2-bit values) retain up to 98.3\% accuracy compared to uniform allocations (e.g., 4-bit for both), while conserving memory. These results transform bit allocation from ad hoc tuning into a theoretically grounded, geometry-driven design principle for efficient LLM inference. Source code is available at https://github.com/mohsenhariri/spectral-kv.
format Preprint
id arxiv_https___arxiv_org_abs_2502_15075
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Quantize What Counts: More for Keys, Less for Values
Hariri, Mohsen
Luo, Alan
Chen, Weicong
Zhong, Shaochen
Zhang, Tianyi
Wang, Qifan
Hu, Xia
Han, Xiaotian
Chaudhary, Vipin
Machine Learning
Large Language Models (LLMs) suffer inference-time memory bottlenecks dominated by the attention Key-Value (KV) cache, which scales with model size and context length. While KV-cache quantization alleviates this cost, bit allocation between keys and values is often tuned heuristically, lacking theoretical grounding and generalizability. This paper proposes two theorems that anchor mixed-precision KV quantization in the intrinsic geometry of Transformer models. First, key projections systematically have larger spectral and Frobenius norms than value matrices, implying higher information density along the key path. Second, for any given memory budget, prioritizing precision for keys over values strictly reduces quantization error and better preserves accuracy. Empirical evaluations across various prominent LLMs and benchmarks show that key-favored allocations (e.g., 4-bit keys, 2-bit values) retain up to 98.3\% accuracy compared to uniform allocations (e.g., 4-bit for both), while conserving memory. These results transform bit allocation from ad hoc tuning into a theoretically grounded, geometry-driven design principle for efficient LLM inference. Source code is available at https://github.com/mohsenhariri/spectral-kv.
title Quantize What Counts: More for Keys, Less for Values
topic Machine Learning
url https://arxiv.org/abs/2502.15075