Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Chen, Yung-Chin, Lee, Chung Peng, Liou, Ze-Wei, Verma, Naveen
Format:	Preprint
Published:	2026
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2606.02288
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916074768826368
author	Chen, Yung-Chin Lee, Chung Peng Liou, Ze-Wei Verma, Naveen
author_facet	Chen, Yung-Chin Lee, Chung Peng Liou, Ze-Wei Verma, Naveen
contents	Massive activation spikes in Large Language Models (LLMs) severely degrade quantization by stretching dynamic ranges. While prior hypotheses characterize these as high-level scalar biases, we argue that they are merely the scalar intermediates of rigid, structural vector biases in the spike-carrying tokens. We show that these tokens converge to constant vectors after normalization that drive the attention sink and value-state drain mechanisms. We geometrically substantiate this by analyzing the coordination of projection weights: $W_K$ contrastively amplifies the vector, $W_Q$ aligns semantic tokens toward it, and $W_V$ projects it into the spectral null-space. Furthermore, we reveal that the model actively preserves these structural biases against Rotary Positional Embedding (RoPE) perturbations by localizing them in "zones of rotational stability" utilizing low-frequency bands and coherent channel pairs. Leveraging this, we propose INSERTQUANT, a post-training quantization (PTQ) framework that clamps spikes and restores their function via pre-computed template vectors. This renders activations strictly spike-free, enabling robust low-bit quantization with high fidelity. INSERTQUANT achieves parity with state-of-the-art per-tensor quantization methods on LLMs and uniquely generalizes beyond text to other modalities such as ViTs.
format	Preprint
id	arxiv_https___arxiv_org_abs_2606_02288
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Massive Spikes in LLMs are Bias Vectors: Mechanistic Uncovering and Spike-Free Quantization Chen, Yung-Chin Lee, Chung Peng Liou, Ze-Wei Verma, Naveen Machine Learning Massive activation spikes in Large Language Models (LLMs) severely degrade quantization by stretching dynamic ranges. While prior hypotheses characterize these as high-level scalar biases, we argue that they are merely the scalar intermediates of rigid, structural vector biases in the spike-carrying tokens. We show that these tokens converge to constant vectors after normalization that drive the attention sink and value-state drain mechanisms. We geometrically substantiate this by analyzing the coordination of projection weights: $W_K$ contrastively amplifies the vector, $W_Q$ aligns semantic tokens toward it, and $W_V$ projects it into the spectral null-space. Furthermore, we reveal that the model actively preserves these structural biases against Rotary Positional Embedding (RoPE) perturbations by localizing them in "zones of rotational stability" utilizing low-frequency bands and coherent channel pairs. Leveraging this, we propose INSERTQUANT, a post-training quantization (PTQ) framework that clamps spikes and restores their function via pre-computed template vectors. This renders activations strictly spike-free, enabling robust low-bit quantization with high fidelity. INSERTQUANT achieves parity with state-of-the-art per-tensor quantization methods on LLMs and uniquely generalizes beyond text to other modalities such as ViTs.
title	Massive Spikes in LLMs are Bias Vectors: Mechanistic Uncovering and Spike-Free Quantization
topic	Machine Learning
url	https://arxiv.org/abs/2606.02288

Similar Items