Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Hariri, Mohsen, Luo, Alan, Chen, Weicong, Zhong, Shaochen, Zhang, Tianyi, Wang, Qifan, Hu, Xia, Han, Xiaotian, Chaudhary, Vipin
Format:	Preprint
Published:	2025
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2502.15075
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909027395436544
author	Hariri, Mohsen Luo, Alan Chen, Weicong Zhong, Shaochen Zhang, Tianyi Wang, Qifan Hu, Xia Han, Xiaotian Chaudhary, Vipin
author_facet	Hariri, Mohsen Luo, Alan Chen, Weicong Zhong, Shaochen Zhang, Tianyi Wang, Qifan Hu, Xia Han, Xiaotian Chaudhary, Vipin
contents	Large Language Models (LLMs) suffer inference-time memory bottlenecks dominated by the attention Key-Value (KV) cache, which scales with model size and context length. While KV-cache quantization alleviates this cost, bit allocation between keys and values is often tuned heuristically, lacking theoretical grounding and generalizability. This paper proposes two theorems that anchor mixed-precision KV quantization in the intrinsic geometry of Transformer models. First, key projections systematically have larger spectral and Frobenius norms than value matrices, implying higher information density along the key path. Second, for any given memory budget, prioritizing precision for keys over values strictly reduces quantization error and better preserves accuracy. Empirical evaluations across various prominent LLMs and benchmarks show that key-favored allocations (e.g., 4-bit keys, 2-bit values) retain up to 98.3\% accuracy compared to uniform allocations (e.g., 4-bit for both), while conserving memory. These results transform bit allocation from ad hoc tuning into a theoretically grounded, geometry-driven design principle for efficient LLM inference. Source code is available at https://github.com/mohsenhariri/spectral-kv.
format	Preprint
id	arxiv_https___arxiv_org_abs_2502_15075
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Quantize What Counts: More for Keys, Less for Values Hariri, Mohsen Luo, Alan Chen, Weicong Zhong, Shaochen Zhang, Tianyi Wang, Qifan Hu, Xia Han, Xiaotian Chaudhary, Vipin Machine Learning Large Language Models (LLMs) suffer inference-time memory bottlenecks dominated by the attention Key-Value (KV) cache, which scales with model size and context length. While KV-cache quantization alleviates this cost, bit allocation between keys and values is often tuned heuristically, lacking theoretical grounding and generalizability. This paper proposes two theorems that anchor mixed-precision KV quantization in the intrinsic geometry of Transformer models. First, key projections systematically have larger spectral and Frobenius norms than value matrices, implying higher information density along the key path. Second, for any given memory budget, prioritizing precision for keys over values strictly reduces quantization error and better preserves accuracy. Empirical evaluations across various prominent LLMs and benchmarks show that key-favored allocations (e.g., 4-bit keys, 2-bit values) retain up to 98.3\% accuracy compared to uniform allocations (e.g., 4-bit for both), while conserving memory. These results transform bit allocation from ad hoc tuning into a theoretically grounded, geometry-driven design principle for efficient LLM inference. Source code is available at https://github.com/mohsenhariri/spectral-kv.
title	Quantize What Counts: More for Keys, Less for Values
topic	Machine Learning
url	https://arxiv.org/abs/2502.15075

Similar Items