Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.17698 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866911458618507264 |
|---|---|
| author | Li, Xinlin Chou, Timothy Fromm, Josh Liu, Zichang Pan, Yunjie Fragouli, Christina |
| author_facet | Li, Xinlin Chou, Timothy Fromm, Josh Liu, Zichang Pan, Yunjie Fragouli, Christina |
| contents | Post-training weight quantization is crucial for reducing the memory and inference cost of large language models (LLMs), yet pushing the average precision below 4 bits remains challenging due to highly non-uniform weight sensitivity and the lack of principled precision allocation. Existing solutions use irregular fine-grained mixed-precision with high runtime overhead or rely on heuristics or highly constrained precision allocation strategies. In this work, we propose ScaleBITS, a mixed-precision quantization framework that enables automated, fine-grained bitwidth allocation under a memory budget while preserving hardware efficiency. Guided by a new sensitivity analysis, we introduce a hardware-aligned, block-wise weight partitioning scheme, powered by bi-directional channel reordering. We formulate global bitwidth allocation as a constrained optimization problem and develop a scalable approximation to the greedy algorithm, enabling end-to-end principled allocation. Experiments show that ScaleBITS significantly improves over uniform-precision quantization (up to +36%) and outperforms state-of-the-art sensitivity-aware baselines (up to +13%) in ultra-low-bit regime, without adding runtime overhead. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2602_17698 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | ScaleBITS: Scalable Bitwidth Search for Hardware-Aligned Mixed-Precision LLMs Li, Xinlin Chou, Timothy Fromm, Josh Liu, Zichang Pan, Yunjie Fragouli, Christina Machine Learning Artificial Intelligence Post-training weight quantization is crucial for reducing the memory and inference cost of large language models (LLMs), yet pushing the average precision below 4 bits remains challenging due to highly non-uniform weight sensitivity and the lack of principled precision allocation. Existing solutions use irregular fine-grained mixed-precision with high runtime overhead or rely on heuristics or highly constrained precision allocation strategies. In this work, we propose ScaleBITS, a mixed-precision quantization framework that enables automated, fine-grained bitwidth allocation under a memory budget while preserving hardware efficiency. Guided by a new sensitivity analysis, we introduce a hardware-aligned, block-wise weight partitioning scheme, powered by bi-directional channel reordering. We formulate global bitwidth allocation as a constrained optimization problem and develop a scalable approximation to the greedy algorithm, enabling end-to-end principled allocation. Experiments show that ScaleBITS significantly improves over uniform-precision quantization (up to +36%) and outperforms state-of-the-art sensitivity-aware baselines (up to +13%) in ultra-low-bit regime, without adding runtime overhead. |
| title | ScaleBITS: Scalable Bitwidth Search for Hardware-Aligned Mixed-Precision LLMs |
| topic | Machine Learning Artificial Intelligence |
| url | https://arxiv.org/abs/2602.17698 |