Guardado en:
Detalles Bibliográficos
Autores principales: Xiao, He, Yang, Qingyao, Xie, Dirui, Xu, Wendong, Su, Zunhai, yang, Runming, Zhou, Wenyong, Liu, Haobo, Liu, Zhengwu, Wong, Ngai
Formato: Preprint
Publicado: 2025
Materias:
Acceso en línea:https://arxiv.org/abs/2508.03332
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
_version_ 1866914221563838464
author Xiao, He
Yang, Qingyao
Xie, Dirui
Xu, Wendong
Su, Zunhai
yang, Runming
Zhou, Wenyong
Liu, Haobo
Liu, Zhengwu
Wong, Ngai
author_facet Xiao, He
Yang, Qingyao
Xie, Dirui
Xu, Wendong
Su, Zunhai
yang, Runming
Zhou, Wenyong
Liu, Haobo
Liu, Zhengwu
Wong, Ngai
contents Large language models with billions of parameters are often over-provisioned: many layers contribute little unique information yet dominate the memory and energy footprint during inference. We present LieQ Layer-wise information effectiveness Quantization, a hardware-native, metric-driven post-training quantization framework that addresses the critical challenge of maintaining accuracy in sub-8B models, model parameters less than 8B, under extreme low-bit compression. LieQ keeps uniform bit-width within each layer while mixing precision across layers, preserving standard multiplication kernels and avoiding irregular memory access, codebooks, or irregular formats at inference time. Our method uncovers a strong correlation between layer-wise functional saliency and representational compactness, revealing that layers with higher training-induced energy concentration are functionally irreplaceable. Leveraging this insight, we propose a purely geometry-driven sensitivity proxy that enables automatic bit-width allocation under a target average-bit budget without expensive gradient updates or inference-based perplexity probing. At sub 2-bit, LieQ consistently reduces the large accuracy gap typically observed for naive 2-bit baselines on Qwen3 and LLaMA3.x families, while retaining standard-kernel efficiency. These properties make LieQ a practical path toward deploying small language models on resource-constrained edge devices. Code will available here: https://github.com/HeXiao-55/LieQ-official.git.
format Preprint
id arxiv_https___arxiv_org_abs_2508_03332
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Exploring Layer-wise Information Effectiveness for Post-Training Quantization in Small Language Models
Xiao, He
Yang, Qingyao
Xie, Dirui
Xu, Wendong
Su, Zunhai
yang, Runming
Zhou, Wenyong
Liu, Haobo
Liu, Zhengwu
Wong, Ngai
Machine Learning
Artificial Intelligence
Large language models with billions of parameters are often over-provisioned: many layers contribute little unique information yet dominate the memory and energy footprint during inference. We present LieQ Layer-wise information effectiveness Quantization, a hardware-native, metric-driven post-training quantization framework that addresses the critical challenge of maintaining accuracy in sub-8B models, model parameters less than 8B, under extreme low-bit compression. LieQ keeps uniform bit-width within each layer while mixing precision across layers, preserving standard multiplication kernels and avoiding irregular memory access, codebooks, or irregular formats at inference time. Our method uncovers a strong correlation between layer-wise functional saliency and representational compactness, revealing that layers with higher training-induced energy concentration are functionally irreplaceable. Leveraging this insight, we propose a purely geometry-driven sensitivity proxy that enables automatic bit-width allocation under a target average-bit budget without expensive gradient updates or inference-based perplexity probing. At sub 2-bit, LieQ consistently reduces the large accuracy gap typically observed for naive 2-bit baselines on Qwen3 and LLaMA3.x families, while retaining standard-kernel efficiency. These properties make LieQ a practical path toward deploying small language models on resource-constrained edge devices. Code will available here: https://github.com/HeXiao-55/LieQ-official.git.
title Exploring Layer-wise Information Effectiveness for Post-Training Quantization in Small Language Models
topic Machine Learning
Artificial Intelligence
url https://arxiv.org/abs/2508.03332