Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Tseng, Albert, Sun, Qingyao, Hou, David, De Sa, Christopher
Format:	Preprint
Published:	2024
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2406.11235
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916797180018688
author	Tseng, Albert Sun, Qingyao Hou, David De Sa, Christopher
author_facet	Tseng, Albert Sun, Qingyao Hou, David De Sa, Christopher
contents	Post-training quantization (PTQ) reduces the memory footprint of LLMs by quantizing weights to low-precision datatypes. Since LLM inference is usually memory-bound, PTQ methods can improve inference throughput. Recent state-of-the-art PTQ approaches use vector quantization (VQ) to quantize multiple weights at once, which improves information utilization through better shaping. However, VQ requires a codebook with size exponential in the dimension. This limits current VQ-based PTQ works to low VQ dimensions ($\le 8$) that in turn limit quantization quality. Here, we introduce QTIP, which instead uses trellis coded quantization (TCQ) to achieve ultra-high-dimensional quantization. TCQ uses a stateful decoder that separates the codebook size from the bitrate and effective dimension. QTIP introduces a spectrum of lookup-only to computed lookup-free trellis codes designed for a hardware-efficient "bitshift" trellis structure; these codes achieve state-of-the-art results in both quantization quality and inference speed.
format	Preprint
id	arxiv_https___arxiv_org_abs_2406_11235
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	QTIP: Quantization with Trellises and Incoherence Processing Tseng, Albert Sun, Qingyao Hou, David De Sa, Christopher Machine Learning Post-training quantization (PTQ) reduces the memory footprint of LLMs by quantizing weights to low-precision datatypes. Since LLM inference is usually memory-bound, PTQ methods can improve inference throughput. Recent state-of-the-art PTQ approaches use vector quantization (VQ) to quantize multiple weights at once, which improves information utilization through better shaping. However, VQ requires a codebook with size exponential in the dimension. This limits current VQ-based PTQ works to low VQ dimensions ($\le 8$) that in turn limit quantization quality. Here, we introduce QTIP, which instead uses trellis coded quantization (TCQ) to achieve ultra-high-dimensional quantization. TCQ uses a stateful decoder that separates the codebook size from the bitrate and effective dimension. QTIP introduces a spectrum of lookup-only to computed lookup-free trellis codes designed for a hardware-efficient "bitshift" trellis structure; these codes achieve state-of-the-art results in both quantization quality and inference speed.
title	QTIP: Quantization with Trellises and Incoherence Processing
topic	Machine Learning
url	https://arxiv.org/abs/2406.11235

Similar Items