Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2402.02446 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866910463660392448 |
|---|---|
| author | Zhang, Cheng Cheng, Jianyi Constantinides, George A. Zhao, Yiren |
| author_facet | Zhang, Cheng Cheng, Jianyi Constantinides, George A. Zhao, Yiren |
| contents | Post-training quantization of Large Language Models (LLMs) is challenging. In this work, we introduce Low-rank Quantization Error Reduction (LQER), which combines quantization and low-rank approximation to recover the model capability. LQER leverages an activation-induced scale matrix to drive the singular value distribution of quantization error towards a desirable distribution, which enables nearly-lossless W4A8 quantization on various LLMs and downstream tasks without the need for knowledge distillation, grid search, or gradient-base iterative optimization. Unlike existing methods, the computation pattern of LQER eliminates the need for specialized Scatter and Gather processes to collect high-precision weights from irregular memory locations. Our W4A8 LLMs achieve near-lossless performance on six popular downstream tasks, while using 1.36$\times$ fewer hardware resources than the leading state-of-the-art method. We open-source our framework at https://github.com/ChengZhang-98/lqer |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2402_02446 |
| institution | arXiv |
| publishDate | 2024 |
| record_format | arxiv |
| spellingShingle | LQER: Low-Rank Quantization Error Reconstruction for LLMs Zhang, Cheng Cheng, Jianyi Constantinides, George A. Zhao, Yiren Machine Learning Computation and Language Post-training quantization of Large Language Models (LLMs) is challenging. In this work, we introduce Low-rank Quantization Error Reduction (LQER), which combines quantization and low-rank approximation to recover the model capability. LQER leverages an activation-induced scale matrix to drive the singular value distribution of quantization error towards a desirable distribution, which enables nearly-lossless W4A8 quantization on various LLMs and downstream tasks without the need for knowledge distillation, grid search, or gradient-base iterative optimization. Unlike existing methods, the computation pattern of LQER eliminates the need for specialized Scatter and Gather processes to collect high-precision weights from irregular memory locations. Our W4A8 LLMs achieve near-lossless performance on six popular downstream tasks, while using 1.36$\times$ fewer hardware resources than the leading state-of-the-art method. We open-source our framework at https://github.com/ChengZhang-98/lqer |
| title | LQER: Low-Rank Quantization Error Reconstruction for LLMs |
| topic | Machine Learning Computation and Language |
| url | https://arxiv.org/abs/2402.02446 |