Saved in:
Bibliographic Details
Main Authors: Luo, Yuchen, Zhu, Fangyue, Zhou, Ruining, Huang, Mingzhe, Zhu, Jian, Fan, Fanyu, Shao, Wei
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.17693
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866912915014025216
author Luo, Yuchen
Zhu, Fangyue
Zhou, Ruining
Huang, Mingzhe
Zhu, Jian
Fan, Fanyu
Shao, Wei
author_facet Luo, Yuchen
Zhu, Fangyue
Zhou, Ruining
Huang, Mingzhe
Zhu, Jian
Fan, Fanyu
Shao, Wei
contents Post-Training Quantization (PTQ) is crucial for efficient model deployment, yet its effectiveness on Ascend NPU remains under-explored compared to GPU architectures. This paper presents a case study of representative PTQ baselines applied to reasoning-oriented models such as DeepSeek-R1-Distill-Qwen series (1.5B/7B/14B) and QwQ-32B. We evaluate four distinct algorithms, including AWQ, GPTQ, SmoothQuant, and FlatQuant, to cover the spectrum from weight-only compression to advanced rotation-based methods. Our empirical results reveal significant platform sensitivity. While 4-bit weight-only quantization proves viable for larger models, aggressive 4-bit weight-activation schemes suffer from layer-wise calibration instability on the NPU, leading to logic collapse in long-context reasoning tasks. Conversely, standard 8-bit quantization remains numerically stable. Furthermore, a real-world INT8 deployment demonstrates that although optimized kernels reduce latency, dynamic quantization overheads currently limit end-to-end acceleration. These findings offer a practical reference for the feasibility and limitations of deploying quantized reasoning models on Ascend NPU.
format Preprint
id arxiv_https___arxiv_org_abs_2602_17693
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle A Case Study of Selected PTQ Baselines for Reasoning LLMs on Ascend NPU
Luo, Yuchen
Zhu, Fangyue
Zhou, Ruining
Huang, Mingzhe
Zhu, Jian
Fan, Fanyu
Shao, Wei
Machine Learning
Artificial Intelligence
Computation and Language
Post-Training Quantization (PTQ) is crucial for efficient model deployment, yet its effectiveness on Ascend NPU remains under-explored compared to GPU architectures. This paper presents a case study of representative PTQ baselines applied to reasoning-oriented models such as DeepSeek-R1-Distill-Qwen series (1.5B/7B/14B) and QwQ-32B. We evaluate four distinct algorithms, including AWQ, GPTQ, SmoothQuant, and FlatQuant, to cover the spectrum from weight-only compression to advanced rotation-based methods. Our empirical results reveal significant platform sensitivity. While 4-bit weight-only quantization proves viable for larger models, aggressive 4-bit weight-activation schemes suffer from layer-wise calibration instability on the NPU, leading to logic collapse in long-context reasoning tasks. Conversely, standard 8-bit quantization remains numerically stable. Furthermore, a real-world INT8 deployment demonstrates that although optimized kernels reduce latency, dynamic quantization overheads currently limit end-to-end acceleration. These findings offer a practical reference for the feasibility and limitations of deploying quantized reasoning models on Ascend NPU.
title A Case Study of Selected PTQ Baselines for Reasoning LLMs on Ascend NPU
topic Machine Learning
Artificial Intelligence
Computation and Language
url https://arxiv.org/abs/2602.17693