MARC21: :: Library Catalog

Salvato in:

Dettagli Bibliografici
Autori principali:	Wang, Chenyu, Yan, Zhanglu, Zhou, Zhi, Chen, Xu, Wong, Weng-Fai
Natura:	Preprint
Pubblicazione:	2025
Soggetti:	Machine Learning
Accesso online:	https://arxiv.org/abs/2510.19498
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

_version_	1866915569812373504
author	Wang, Chenyu Yan, Zhanglu Zhou, Zhi Chen, Xu Wong, Weng-Fai
author_facet	Wang, Chenyu Yan, Zhanglu Zhou, Zhi Chen, Xu Wong, Weng-Fai
contents	In the era of large language models (LLMs), weight-activation quantization helps fit models on edge device by reducing memory and compute bit-widths. However, three challenges persist for energy constrained hardware: (1) even after quantization, multiply-accumulate (MAC) operations remain unavoidable and continue to dominate energy consumption; (2) dequantization (or per-tensor/channel rescaling) introduces extra arithmetic and data movement, increasing latency and energy; (3) uniform parameters bit widths clip salient values-while intra-channel mixed precision is generally impractical on current matrix hardware and memory. In contrast, brain-inspired Spiking Neural Networks (SNNs), owing to their binary spike-based information representation and the Integrate-and-Fire (IF) paradigm, naturally support mixed-precision storage and energy-efficient computation by replacing complex MACs with temporal Accumulate (ACCs). Motivated by this property, we propose SpikeQuant, which selectively applies mixed-precision quantization to activations with salient values and re-encodes them into binary spike counts, thereby enabling dynamic mixed storage of different bitwidths. Furthermore, by embedding the quantization scale into the threshold of the IF mechanism, our approach performs energy-efficient linear transformations on weights and activations while avoiding explicit dequantization. Experimental results demonstrate that SpikeQuant consistently achieves near-FP16 perplexity under W4A4 quantization while reducing energy cost by up to 4.6 times compared to existing methods, highlighting its effectiveness for accurate and energy-efficient LLM deployment.
format	Preprint
id	arxiv_https___arxiv_org_abs_2510_19498
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Energy-Efficient and Dequantization-Free Q-LLMs: A Spiking Neural Network Approach to Salient Value Mitigation Wang, Chenyu Yan, Zhanglu Zhou, Zhi Chen, Xu Wong, Weng-Fai Machine Learning In the era of large language models (LLMs), weight-activation quantization helps fit models on edge device by reducing memory and compute bit-widths. However, three challenges persist for energy constrained hardware: (1) even after quantization, multiply-accumulate (MAC) operations remain unavoidable and continue to dominate energy consumption; (2) dequantization (or per-tensor/channel rescaling) introduces extra arithmetic and data movement, increasing latency and energy; (3) uniform parameters bit widths clip salient values-while intra-channel mixed precision is generally impractical on current matrix hardware and memory. In contrast, brain-inspired Spiking Neural Networks (SNNs), owing to their binary spike-based information representation and the Integrate-and-Fire (IF) paradigm, naturally support mixed-precision storage and energy-efficient computation by replacing complex MACs with temporal Accumulate (ACCs). Motivated by this property, we propose SpikeQuant, which selectively applies mixed-precision quantization to activations with salient values and re-encodes them into binary spike counts, thereby enabling dynamic mixed storage of different bitwidths. Furthermore, by embedding the quantization scale into the threshold of the IF mechanism, our approach performs energy-efficient linear transformations on weights and activations while avoiding explicit dequantization. Experimental results demonstrate that SpikeQuant consistently achieves near-FP16 perplexity under W4A4 quantization while reducing energy cost by up to 4.6 times compared to existing methods, highlighting its effectiveness for accurate and energy-efficient LLM deployment.
title	Energy-Efficient and Dequantization-Free Q-LLMs: A Spiking Neural Network Approach to Salient Value Mitigation
topic	Machine Learning
url	https://arxiv.org/abs/2510.19498

Documenti analoghi