Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2406.11235 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866916797180018688 |
|---|---|
| author | Tseng, Albert Sun, Qingyao Hou, David De Sa, Christopher |
| author_facet | Tseng, Albert Sun, Qingyao Hou, David De Sa, Christopher |
| contents | Post-training quantization (PTQ) reduces the memory footprint of LLMs by quantizing weights to low-precision datatypes. Since LLM inference is usually memory-bound, PTQ methods can improve inference throughput. Recent state-of-the-art PTQ approaches use vector quantization (VQ) to quantize multiple weights at once, which improves information utilization through better shaping. However, VQ requires a codebook with size exponential in the dimension. This limits current VQ-based PTQ works to low VQ dimensions ($\le 8$) that in turn limit quantization quality. Here, we introduce QTIP, which instead uses trellis coded quantization (TCQ) to achieve ultra-high-dimensional quantization. TCQ uses a stateful decoder that separates the codebook size from the bitrate and effective dimension. QTIP introduces a spectrum of lookup-only to computed lookup-free trellis codes designed for a hardware-efficient "bitshift" trellis structure; these codes achieve state-of-the-art results in both quantization quality and inference speed. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2406_11235 |
| institution | arXiv |
| publishDate | 2024 |
| record_format | arxiv |
| spellingShingle | QTIP: Quantization with Trellises and Incoherence Processing Tseng, Albert Sun, Qingyao Hou, David De Sa, Christopher Machine Learning Post-training quantization (PTQ) reduces the memory footprint of LLMs by quantizing weights to low-precision datatypes. Since LLM inference is usually memory-bound, PTQ methods can improve inference throughput. Recent state-of-the-art PTQ approaches use vector quantization (VQ) to quantize multiple weights at once, which improves information utilization through better shaping. However, VQ requires a codebook with size exponential in the dimension. This limits current VQ-based PTQ works to low VQ dimensions ($\le 8$) that in turn limit quantization quality. Here, we introduce QTIP, which instead uses trellis coded quantization (TCQ) to achieve ultra-high-dimensional quantization. TCQ uses a stateful decoder that separates the codebook size from the bitrate and effective dimension. QTIP introduces a spectrum of lookup-only to computed lookup-free trellis codes designed for a hardware-efficient "bitshift" trellis structure; these codes achieve state-of-the-art results in both quantization quality and inference speed. |
| title | QTIP: Quantization with Trellises and Incoherence Processing |
| topic | Machine Learning |
| url | https://arxiv.org/abs/2406.11235 |