Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2511.02302 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866911248068640768 |
|---|---|
| author | Wang, Fengjuan Su, Zhiyi Hu, Xingzhu Wang, Cheng Sun, Mou |
| author_facet | Wang, Fengjuan Su, Zhiyi Hu, Xingzhu Wang, Cheng Sun, Mou |
| contents | Training large Mixture-of-Experts (MoE) models remains computationally prohibitive due to their extreme compute and memory demands. Although low-precision training promises to accelerate computation and reduce memory footprint, existing implementations still rely on BF16-dominated dataflows with frequent quantize-dequantize (Q/DQ) conversions. These redundant casts erode much of FP8's theoretical efficiency. However, naively removing these casts by keeping dataflows entirely in FP8 introduces double quantization error: tensors quantized along different dimensions accumulate inconsistent scaling factors, degrading numerical stability.
We propose FP8-Flow-MoE, an FP8 training recipe featuring a quantization-consistent FP8-centric dataflow with a scaling-aware transpose and fused FP8 operators that streamline computation and eliminate explicit cast operations from 12 to 2. Evaluations on a 671B-parameter MoE model demonstrate up to 21\% higher throughput and 16.5 GB lower memory usage per GPU compared to BF16 and naïve FP8 baselines, while maintaining stable convergence. We provide a plug-and-play FP8 recipe compatible with TransformerEngine and Megatron-LM, which will be open-sourced soon. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2511_02302 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | FP8-Flow-MoE: A Casting-Free FP8 Recipe without Double Quantization Error Wang, Fengjuan Su, Zhiyi Hu, Xingzhu Wang, Cheng Sun, Mou Machine Learning Artificial Intelligence Training large Mixture-of-Experts (MoE) models remains computationally prohibitive due to their extreme compute and memory demands. Although low-precision training promises to accelerate computation and reduce memory footprint, existing implementations still rely on BF16-dominated dataflows with frequent quantize-dequantize (Q/DQ) conversions. These redundant casts erode much of FP8's theoretical efficiency. However, naively removing these casts by keeping dataflows entirely in FP8 introduces double quantization error: tensors quantized along different dimensions accumulate inconsistent scaling factors, degrading numerical stability. We propose FP8-Flow-MoE, an FP8 training recipe featuring a quantization-consistent FP8-centric dataflow with a scaling-aware transpose and fused FP8 operators that streamline computation and eliminate explicit cast operations from 12 to 2. Evaluations on a 671B-parameter MoE model demonstrate up to 21\% higher throughput and 16.5 GB lower memory usage per GPU compared to BF16 and naïve FP8 baselines, while maintaining stable convergence. We provide a plug-and-play FP8 recipe compatible with TransformerEngine and Megatron-LM, which will be open-sourced soon. |
| title | FP8-Flow-MoE: A Casting-Free FP8 Recipe without Double Quantization Error |
| topic | Machine Learning Artificial Intelligence |
| url | https://arxiv.org/abs/2511.02302 |