Table of Contents: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Gupta, Aman, Celente, Rafael, Shivanna, Abhishek, Braithwaite, D. T., Dexter, Gregory, Tang, Shao, Udagawa, Hiroto, Silva, Daniel, Ramanath, Rohan, Keerthi, S. Sathiya
Format:	Preprint
Published:	2025
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2509.23106
Tags:	Add Tag No Tags, Be the first to tag this record!

Table of Contents:

The Muon optimizer, based on matrix orthogonalization, has recently shown faster convergence and better computational efficiency over AdamW in LLM pre-training. However, the memory overhead of maintaining high-precision optimizer states remains a challenge for large-scale deployment. In this paper, we introduce the 8-bit Muon optimizer using blockwise quantization. In extensive Chinchilla-optimal experiments on pre-training models of up to 2.7B in size and fine-tuning them for instruction following, we demonstrate that 8-bit Muon achieves parity with Muon in terms of validation loss and downstream benchmarks, while achieving up to a 62\% reduction in optimizer state footprint. Crucially, we show that Muon's update mechanism is uniquely compatible with a simple linear quantization scheme, bypassing the complex dynamic scaling required for quantized AdamW. We supplement our empirical findings with a theoretical analysis of Muon's robustness to quantization noise.

Similar Items