Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Xia, Haojun, Wu, Xiaoxia, Li, Jisen, Wu, Robert, Wang, Junxiong, Wang, Jue, Li, Chenxi, Singhal, Aman, Shah, Alay Dilipbhai, Ariyak, Alpay, Zhuang, Donglin, Zhou, Zhongzhu, Athiwaratkun, Ben, Zheng, Zhen, Song, Shuaiwen Leon
Format:	Preprint
Published:	2025
Subjects:	Machine Learning Artificial Intelligence
Online Access:	https://arxiv.org/abs/2511.18643
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909920225394688
author	Xia, Haojun Wu, Xiaoxia Li, Jisen Wu, Robert Wang, Junxiong Wang, Jue Li, Chenxi Singhal, Aman Shah, Alay Dilipbhai Ariyak, Alpay Zhuang, Donglin Zhou, Zhongzhu Athiwaratkun, Ben Zheng, Zhen Song, Shuaiwen Leon
author_facet	Xia, Haojun Wu, Xiaoxia Li, Jisen Wu, Robert Wang, Junxiong Wang, Jue Li, Chenxi Singhal, Aman Shah, Alay Dilipbhai Ariyak, Alpay Zhuang, Donglin Zhou, Zhongzhu Athiwaratkun, Ben Zheng, Zhen Song, Shuaiwen Leon
contents	The KV cache is a dominant memory bottleneck for LLM inference. While 4-bit KV quantization preserves accuracy, 2-bit often degrades it, especially on long-context reasoning. We close this gap via an algorithm-system co-design for mixed-precision KV caching: Kitty. On the algorithm side, extensive experiments show that Dynamic Channel-wise Precision Boost -- which ranks Key-cache channels by sensitivity and keeps only a small fraction at higher precision -- maintains near-zero loss in accuracy drop while approaching 2-bit memory. The main challenge is handling dynamic 4-bit channel boosts while keeping the page layout coalesced and the dequantization uniform, with no scattered reads or hard-coded masks. Kitty addresses these issues by decompose each mixed-precision Key page into two tensors with unified 2-bit precision. Based on this, Kitty provides a page-centric KV layout, Triton-compatible page dequantization kernels, and a lightweight runtime pipeline that preserves coalescing and avoids divergence. Across seven tasks and two model families (Qwen3, LLaMA3), Kitty cuts KV memory by nearly 8x with negligible accuracy loss, enabling up to 8x larger batches and 2.1x-4.1x higher throughput under the same memory budget. We release the full implementation of Kitty at https://github.com/Summer-Summer/Kitty.
format	Preprint
id	arxiv_https___arxiv_org_abs_2511_18643
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Kitty: Accurate and Efficient 2-bit KV Cache Quantization with Dynamic Channel-wise Precision Boost Xia, Haojun Wu, Xiaoxia Li, Jisen Wu, Robert Wang, Junxiong Wang, Jue Li, Chenxi Singhal, Aman Shah, Alay Dilipbhai Ariyak, Alpay Zhuang, Donglin Zhou, Zhongzhu Athiwaratkun, Ben Zheng, Zhen Song, Shuaiwen Leon Machine Learning Artificial Intelligence The KV cache is a dominant memory bottleneck for LLM inference. While 4-bit KV quantization preserves accuracy, 2-bit often degrades it, especially on long-context reasoning. We close this gap via an algorithm-system co-design for mixed-precision KV caching: Kitty. On the algorithm side, extensive experiments show that Dynamic Channel-wise Precision Boost -- which ranks Key-cache channels by sensitivity and keeps only a small fraction at higher precision -- maintains near-zero loss in accuracy drop while approaching 2-bit memory. The main challenge is handling dynamic 4-bit channel boosts while keeping the page layout coalesced and the dequantization uniform, with no scattered reads or hard-coded masks. Kitty addresses these issues by decompose each mixed-precision Key page into two tensors with unified 2-bit precision. Based on this, Kitty provides a page-centric KV layout, Triton-compatible page dequantization kernels, and a lightweight runtime pipeline that preserves coalescing and avoids divergence. Across seven tasks and two model families (Qwen3, LLaMA3), Kitty cuts KV memory by nearly 8x with negligible accuracy loss, enabling up to 8x larger batches and 2.1x-4.1x higher throughput under the same memory budget. We release the full implementation of Kitty at https://github.com/Summer-Summer/Kitty.
title	Kitty: Accurate and Efficient 2-bit KV Cache Quantization with Dynamic Channel-wise Precision Boost
topic	Machine Learning Artificial Intelligence
url	https://arxiv.org/abs/2511.18643

Similar Items