Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Guo, Yi, Kong, Fanliu, Li, Xiaoyang, Li, Hui, Chen, Wei, Tian, Xiaogang, Cai, Jinping, Zhang, Yang, Liu, Shouda
Format:	Preprint
Published:	2024
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2404.12759
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909175771037696
author	Guo, Yi Kong, Fanliu Li, Xiaoyang Li, Hui Chen, Wei Tian, Xiaogang Cai, Jinping Zhang, Yang Liu, Shouda
author_facet	Guo, Yi Kong, Fanliu Li, Xiaoyang Li, Hui Chen, Wei Tian, Xiaogang Cai, Jinping Zhang, Yang Liu, Shouda
contents	Quantization emerges as one of the most promising compression technologies for deploying efficient large models for various real time application in recent years. Considering that the storage and IO of weights take up the vast majority of the overhead inside a large model, weight only quantization can lead to large gains. However, existing quantization schemes suffer from significant accuracy degradation at very low bits, or require some additional computational overhead when deployed, making it difficult to be applied to large-scale applications in industry. In this paper, we propose decoupleQ, achieving a substantial increase in model accuracy, especially at very low bits. decoupleQ abandons the traditional heuristic quantization paradigm and decouples the model parameters into integer and floating-point parts, thus transforming the quantization problem into a traditional mathematical optimization problem with constraints, which is then solved alternatively by off-the-shelf optimization methods. Quantization via decoupleQ is linear and uniform, making it hardware-friendlier than non-uniform counterpart, and enabling the idea to be migrated to high-bit quantization to enhance its robustness. Our method has achieved well on-line accuracy near fp16/bf16 on the 2-bit quantization of large speech models in ByteDance. The code is available at https://github.com/bytedance/decoupleQ
format	Preprint
id	arxiv_https___arxiv_org_abs_2404_12759
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	decoupleQ: Towards 2-bit Post-Training Uniform Quantization via decoupling Parameters into Integer and Floating Points Guo, Yi Kong, Fanliu Li, Xiaoyang Li, Hui Chen, Wei Tian, Xiaogang Cai, Jinping Zhang, Yang Liu, Shouda Machine Learning Quantization emerges as one of the most promising compression technologies for deploying efficient large models for various real time application in recent years. Considering that the storage and IO of weights take up the vast majority of the overhead inside a large model, weight only quantization can lead to large gains. However, existing quantization schemes suffer from significant accuracy degradation at very low bits, or require some additional computational overhead when deployed, making it difficult to be applied to large-scale applications in industry. In this paper, we propose decoupleQ, achieving a substantial increase in model accuracy, especially at very low bits. decoupleQ abandons the traditional heuristic quantization paradigm and decouples the model parameters into integer and floating-point parts, thus transforming the quantization problem into a traditional mathematical optimization problem with constraints, which is then solved alternatively by off-the-shelf optimization methods. Quantization via decoupleQ is linear and uniform, making it hardware-friendlier than non-uniform counterpart, and enabling the idea to be migrated to high-bit quantization to enhance its robustness. Our method has achieved well on-line accuracy near fp16/bf16 on the 2-bit quantization of large speech models in ByteDance. The code is available at https://github.com/bytedance/decoupleQ
title	decoupleQ: Towards 2-bit Post-Training Uniform Quantization via decoupling Parameters into Integer and Floating Points
topic	Machine Learning
url	https://arxiv.org/abs/2404.12759

Similar Items