MARC21: :: Library Catalog

Salvato in:

Dettagli Bibliografici
Autori principali:	Hu, Yuezhou, Huang, Weiyu, Liang, Zichen, Chen, Chang, Zhang, Jintao, Zhu, Jun, Chen, Jianfei
Natura:	Preprint
Pubblicazione:	2025
Soggetti:	Machine Learning Artificial Intelligence
Accesso online:	https://arxiv.org/abs/2503.01901
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

_version_	1866912257568407552
author	Hu, Yuezhou Huang, Weiyu Liang, Zichen Chen, Chang Zhang, Jintao Zhu, Jun Chen, Jianfei
author_facet	Hu, Yuezhou Huang, Weiyu Liang, Zichen Chen, Chang Zhang, Jintao Zhu, Jun Chen, Jianfei
contents	Serving Large Language Models (LLMs) is costly. However, post-training weight quantization can address this problem by both compressing their sizes for limited memory and saving bandwidth for acceleration. As not all weight dimensions are equally important, those methods typically rely on a sensitivity metric, which indicates the element-wise influence of weights on loss function and is used to preprocess original weights for better quantization. In this work, we conduct an empirical study on the accuracy of the sensitivity metric, and find that existing gradient and Hessian based metrics are very inaccurate: they underestimate quantization's impact on the loss function by orders of magnitude, mainly due to the small convergence radius of local 2nd order approximation, \ie, gradient and Hessian term in Taylor's formula. To tackle this problem, we propose Post-quantization Integral (PQI), an accurate metric to estimate posterior sensitivity in a fine-grained manner. To leverage this accurate metric, we further propose ReQuant, a simple yet powerful framework that mainly consists of two Dense-and-Sparse detach components: self-adaptive outlier selection and step-wise significant weights detach. Results show that ReQuant boosts state-of-the-art post-training quantization methods, with a pronounced improvement of 2.66 perplexity gain on Llama 3.2 1B with QTIP.
format	Preprint
id	arxiv_https___arxiv_org_abs_2503_01901
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Identifying Sensitive Weights via Post-quantization Integral Hu, Yuezhou Huang, Weiyu Liang, Zichen Chen, Chang Zhang, Jintao Zhu, Jun Chen, Jianfei Machine Learning Artificial Intelligence Serving Large Language Models (LLMs) is costly. However, post-training weight quantization can address this problem by both compressing their sizes for limited memory and saving bandwidth for acceleration. As not all weight dimensions are equally important, those methods typically rely on a sensitivity metric, which indicates the element-wise influence of weights on loss function and is used to preprocess original weights for better quantization. In this work, we conduct an empirical study on the accuracy of the sensitivity metric, and find that existing gradient and Hessian based metrics are very inaccurate: they underestimate quantization's impact on the loss function by orders of magnitude, mainly due to the small convergence radius of local 2nd order approximation, \ie, gradient and Hessian term in Taylor's formula. To tackle this problem, we propose Post-quantization Integral (PQI), an accurate metric to estimate posterior sensitivity in a fine-grained manner. To leverage this accurate metric, we further propose ReQuant, a simple yet powerful framework that mainly consists of two Dense-and-Sparse detach components: self-adaptive outlier selection and step-wise significant weights detach. Results show that ReQuant boosts state-of-the-art post-training quantization methods, with a pronounced improvement of 2.66 perplexity gain on Llama 3.2 1B with QTIP.
title	Identifying Sensitive Weights via Post-quantization Integral
topic	Machine Learning Artificial Intelligence
url	https://arxiv.org/abs/2503.01901

Documenti analoghi