Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Luo, Yuchen, Zhu, Fangyue, Zhou, Ruining, Huang, Mingzhe, Zhu, Jian, Fan, Fanyu, Shao, Wei
Format:	Preprint
Published:	2026
Subjects:	Machine Learning Artificial Intelligence Computation and Language
Online Access:	https://arxiv.org/abs/2602.17693
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912915014025216
author	Luo, Yuchen Zhu, Fangyue Zhou, Ruining Huang, Mingzhe Zhu, Jian Fan, Fanyu Shao, Wei
author_facet	Luo, Yuchen Zhu, Fangyue Zhou, Ruining Huang, Mingzhe Zhu, Jian Fan, Fanyu Shao, Wei
contents	Post-Training Quantization (PTQ) is crucial for efficient model deployment, yet its effectiveness on Ascend NPU remains under-explored compared to GPU architectures. This paper presents a case study of representative PTQ baselines applied to reasoning-oriented models such as DeepSeek-R1-Distill-Qwen series (1.5B/7B/14B) and QwQ-32B. We evaluate four distinct algorithms, including AWQ, GPTQ, SmoothQuant, and FlatQuant, to cover the spectrum from weight-only compression to advanced rotation-based methods. Our empirical results reveal significant platform sensitivity. While 4-bit weight-only quantization proves viable for larger models, aggressive 4-bit weight-activation schemes suffer from layer-wise calibration instability on the NPU, leading to logic collapse in long-context reasoning tasks. Conversely, standard 8-bit quantization remains numerically stable. Furthermore, a real-world INT8 deployment demonstrates that although optimized kernels reduce latency, dynamic quantization overheads currently limit end-to-end acceleration. These findings offer a practical reference for the feasibility and limitations of deploying quantized reasoning models on Ascend NPU.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_17693
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	A Case Study of Selected PTQ Baselines for Reasoning LLMs on Ascend NPU Luo, Yuchen Zhu, Fangyue Zhou, Ruining Huang, Mingzhe Zhu, Jian Fan, Fanyu Shao, Wei Machine Learning Artificial Intelligence Computation and Language Post-Training Quantization (PTQ) is crucial for efficient model deployment, yet its effectiveness on Ascend NPU remains under-explored compared to GPU architectures. This paper presents a case study of representative PTQ baselines applied to reasoning-oriented models such as DeepSeek-R1-Distill-Qwen series (1.5B/7B/14B) and QwQ-32B. We evaluate four distinct algorithms, including AWQ, GPTQ, SmoothQuant, and FlatQuant, to cover the spectrum from weight-only compression to advanced rotation-based methods. Our empirical results reveal significant platform sensitivity. While 4-bit weight-only quantization proves viable for larger models, aggressive 4-bit weight-activation schemes suffer from layer-wise calibration instability on the NPU, leading to logic collapse in long-context reasoning tasks. Conversely, standard 8-bit quantization remains numerically stable. Furthermore, a real-world INT8 deployment demonstrates that although optimized kernels reduce latency, dynamic quantization overheads currently limit end-to-end acceleration. These findings offer a practical reference for the feasibility and limitations of deploying quantized reasoning models on Ascend NPU.
title	A Case Study of Selected PTQ Baselines for Reasoning LLMs on Ascend NPU
topic	Machine Learning Artificial Intelligence Computation and Language
url	https://arxiv.org/abs/2602.17693

Similar Items