Saved in:
| Main Authors: | , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.17693 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866912915014025216 |
|---|---|
| author | Luo, Yuchen Zhu, Fangyue Zhou, Ruining Huang, Mingzhe Zhu, Jian Fan, Fanyu Shao, Wei |
| author_facet | Luo, Yuchen Zhu, Fangyue Zhou, Ruining Huang, Mingzhe Zhu, Jian Fan, Fanyu Shao, Wei |
| contents | Post-Training Quantization (PTQ) is crucial for efficient model deployment, yet its effectiveness on Ascend NPU remains under-explored compared to GPU architectures. This paper presents a case study of representative PTQ baselines applied to reasoning-oriented models such as DeepSeek-R1-Distill-Qwen series (1.5B/7B/14B) and QwQ-32B. We evaluate four distinct algorithms, including AWQ, GPTQ, SmoothQuant, and FlatQuant, to cover the spectrum from weight-only compression to advanced rotation-based methods. Our empirical results reveal significant platform sensitivity. While 4-bit weight-only quantization proves viable for larger models, aggressive 4-bit weight-activation schemes suffer from layer-wise calibration instability on the NPU, leading to logic collapse in long-context reasoning tasks. Conversely, standard 8-bit quantization remains numerically stable. Furthermore, a real-world INT8 deployment demonstrates that although optimized kernels reduce latency, dynamic quantization overheads currently limit end-to-end acceleration. These findings offer a practical reference for the feasibility and limitations of deploying quantized reasoning models on Ascend NPU. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2602_17693 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | A Case Study of Selected PTQ Baselines for Reasoning LLMs on Ascend NPU Luo, Yuchen Zhu, Fangyue Zhou, Ruining Huang, Mingzhe Zhu, Jian Fan, Fanyu Shao, Wei Machine Learning Artificial Intelligence Computation and Language Post-Training Quantization (PTQ) is crucial for efficient model deployment, yet its effectiveness on Ascend NPU remains under-explored compared to GPU architectures. This paper presents a case study of representative PTQ baselines applied to reasoning-oriented models such as DeepSeek-R1-Distill-Qwen series (1.5B/7B/14B) and QwQ-32B. We evaluate four distinct algorithms, including AWQ, GPTQ, SmoothQuant, and FlatQuant, to cover the spectrum from weight-only compression to advanced rotation-based methods. Our empirical results reveal significant platform sensitivity. While 4-bit weight-only quantization proves viable for larger models, aggressive 4-bit weight-activation schemes suffer from layer-wise calibration instability on the NPU, leading to logic collapse in long-context reasoning tasks. Conversely, standard 8-bit quantization remains numerically stable. Furthermore, a real-world INT8 deployment demonstrates that although optimized kernels reduce latency, dynamic quantization overheads currently limit end-to-end acceleration. These findings offer a practical reference for the feasibility and limitations of deploying quantized reasoning models on Ascend NPU. |
| title | A Case Study of Selected PTQ Baselines for Reasoning LLMs on Ascend NPU |
| topic | Machine Learning Artificial Intelligence Computation and Language |
| url | https://arxiv.org/abs/2602.17693 |