Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	van Baalen, Mart, Kuzmin, Andrey, Koryakovskiy, Ivan, Nagel, Markus, Couperus, Peter, Bastoul, Cedric, Mahurin, Eric, Blankevoort, Tijmen, Whatmough, Paul
Format:	Preprint
Published:	2024
Subjects:	Machine Learning Computation and Language
Online Access:	https://arxiv.org/abs/2402.15319
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912409763971072
author	van Baalen, Mart Kuzmin, Andrey Koryakovskiy, Ivan Nagel, Markus Couperus, Peter Bastoul, Cedric Mahurin, Eric Blankevoort, Tijmen Whatmough, Paul
author_facet	van Baalen, Mart Kuzmin, Andrey Koryakovskiy, Ivan Nagel, Markus Couperus, Peter Bastoul, Cedric Mahurin, Eric Blankevoort, Tijmen Whatmough, Paul
contents	In this work we show that the size versus accuracy trade-off of neural network quantization can be significantly improved by increasing the quantization dimensionality. We propose the GPTVQ method, a new fast method for post-training vector quantization (VQ) that scales well to Large Language Models (LLMs). Our method interleaves quantization of one or more columns with updates to the remaining unquantized weights, using information from the Hessian of the per-layer output reconstruction MSE. Quantization codebooks are initialized using an efficient data-aware version of the EM algorithm. The codebooks are then updated, and further compressed by using integer quantization and SVD-based compression. GPTVQ establishes a new state-of-the art in the size vs accuracy trade-offs on a wide range of LLMs such as Llama-v2 and Mistral. Furthermore, our method is efficient: on a single H100 it takes between 3 and 11 hours to process a Llamav2-70B model, depending on quantization setting. Lastly, with on-device timings for VQ decompression on a mobile CPU we show that VQ leads to improved latency compared to using a 4-bit integer format.
format	Preprint
id	arxiv_https___arxiv_org_abs_2402_15319
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	GPTVQ: The Blessing of Dimensionality for LLM Quantization van Baalen, Mart Kuzmin, Andrey Koryakovskiy, Ivan Nagel, Markus Couperus, Peter Bastoul, Cedric Mahurin, Eric Blankevoort, Tijmen Whatmough, Paul Machine Learning Computation and Language In this work we show that the size versus accuracy trade-off of neural network quantization can be significantly improved by increasing the quantization dimensionality. We propose the GPTVQ method, a new fast method for post-training vector quantization (VQ) that scales well to Large Language Models (LLMs). Our method interleaves quantization of one or more columns with updates to the remaining unquantized weights, using information from the Hessian of the per-layer output reconstruction MSE. Quantization codebooks are initialized using an efficient data-aware version of the EM algorithm. The codebooks are then updated, and further compressed by using integer quantization and SVD-based compression. GPTVQ establishes a new state-of-the art in the size vs accuracy trade-offs on a wide range of LLMs such as Llama-v2 and Mistral. Furthermore, our method is efficient: on a single H100 it takes between 3 and 11 hours to process a Llamav2-70B model, depending on quantization setting. Lastly, with on-device timings for VQ decompression on a mobile CPU we show that VQ leads to improved latency compared to using a 4-bit integer format.
title	GPTVQ: The Blessing of Dimensionality for LLM Quantization
topic	Machine Learning Computation and Language
url	https://arxiv.org/abs/2402.15319

Similar Items