Saved in:
Bibliographic Details
Main Authors: van Baalen, Mart, Kuzmin, Andrey, Koryakovskiy, Ivan, Nagel, Markus, Couperus, Peter, Bastoul, Cedric, Mahurin, Eric, Blankevoort, Tijmen, Whatmough, Paul
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2402.15319
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866912409763971072
author van Baalen, Mart
Kuzmin, Andrey
Koryakovskiy, Ivan
Nagel, Markus
Couperus, Peter
Bastoul, Cedric
Mahurin, Eric
Blankevoort, Tijmen
Whatmough, Paul
author_facet van Baalen, Mart
Kuzmin, Andrey
Koryakovskiy, Ivan
Nagel, Markus
Couperus, Peter
Bastoul, Cedric
Mahurin, Eric
Blankevoort, Tijmen
Whatmough, Paul
contents In this work we show that the size versus accuracy trade-off of neural network quantization can be significantly improved by increasing the quantization dimensionality. We propose the GPTVQ method, a new fast method for post-training vector quantization (VQ) that scales well to Large Language Models (LLMs). Our method interleaves quantization of one or more columns with updates to the remaining unquantized weights, using information from the Hessian of the per-layer output reconstruction MSE. Quantization codebooks are initialized using an efficient data-aware version of the EM algorithm. The codebooks are then updated, and further compressed by using integer quantization and SVD-based compression. GPTVQ establishes a new state-of-the art in the size vs accuracy trade-offs on a wide range of LLMs such as Llama-v2 and Mistral. Furthermore, our method is efficient: on a single H100 it takes between 3 and 11 hours to process a Llamav2-70B model, depending on quantization setting. Lastly, with on-device timings for VQ decompression on a mobile CPU we show that VQ leads to improved latency compared to using a 4-bit integer format.
format Preprint
id arxiv_https___arxiv_org_abs_2402_15319
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle GPTVQ: The Blessing of Dimensionality for LLM Quantization
van Baalen, Mart
Kuzmin, Andrey
Koryakovskiy, Ivan
Nagel, Markus
Couperus, Peter
Bastoul, Cedric
Mahurin, Eric
Blankevoort, Tijmen
Whatmough, Paul
Machine Learning
Computation and Language
In this work we show that the size versus accuracy trade-off of neural network quantization can be significantly improved by increasing the quantization dimensionality. We propose the GPTVQ method, a new fast method for post-training vector quantization (VQ) that scales well to Large Language Models (LLMs). Our method interleaves quantization of one or more columns with updates to the remaining unquantized weights, using information from the Hessian of the per-layer output reconstruction MSE. Quantization codebooks are initialized using an efficient data-aware version of the EM algorithm. The codebooks are then updated, and further compressed by using integer quantization and SVD-based compression. GPTVQ establishes a new state-of-the art in the size vs accuracy trade-offs on a wide range of LLMs such as Llama-v2 and Mistral. Furthermore, our method is efficient: on a single H100 it takes between 3 and 11 hours to process a Llamav2-70B model, depending on quantization setting. Lastly, with on-device timings for VQ decompression on a mobile CPU we show that VQ leads to improved latency compared to using a 4-bit integer format.
title GPTVQ: The Blessing of Dimensionality for LLM Quantization
topic Machine Learning
Computation and Language
url https://arxiv.org/abs/2402.15319