Saved in:
| Main Authors: | , , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.09721 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866914319130689536 |
|---|---|
| author | Liu, Guowei Li, Hongming Guo, Yaning Lyu, Yongxi Zhou, Mo Liu, Yi Li, Zhaogeng Wang, Yanpeng |
| author_facet | Liu, Guowei Li, Hongming Guo, Yaning Lyu, Yongxi Zhou, Mo Liu, Yi Li, Zhaogeng Wang, Yanpeng |
| contents | Deploying large-scale MoE models presents challenges in memory capacity and bandwidth for expert activation. While Attention-FFN Disaggregation (AFD) has emerged as a potential architecture to decouple compute and memory resources, its performance boundaries compared to standard large-scale Expert Parallelism (EP) remain underexplored. In this paper, we conduct a systematic analysis of AFD by extending the roofline model to the communication level, correlating interconnect bandwidth, arithmetic intensity, and Hardware FLOPS Utilization (HFU). Our analysis reveals a dead zone on standard clusters: increasing FFN instance count fails to improve HFU as computational workload is capped by scale-out bandwidth, causing operator active time to shrink relative to the fixed latency budget. We further show that AFD's discrete node-level scaling incurs higher imbalance penalties than EP's continuous batch adjustment. Nevertheless, these limitations diminish under specific conditions: Superpod-class hardware with abundant interconnect bandwidth and models with coarse-grained experts and lower sparsity are more likely to benefit from AFD. These findings position AFD as a promising approach for specific hardware-model combinations rather than a universal solution. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2602_09721 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | Revealing the Challenges of Attention-FFN Disaggregation for Modern MoE Models and Hardware Systems Liu, Guowei Li, Hongming Guo, Yaning Lyu, Yongxi Zhou, Mo Liu, Yi Li, Zhaogeng Wang, Yanpeng Distributed, Parallel, and Cluster Computing Deploying large-scale MoE models presents challenges in memory capacity and bandwidth for expert activation. While Attention-FFN Disaggregation (AFD) has emerged as a potential architecture to decouple compute and memory resources, its performance boundaries compared to standard large-scale Expert Parallelism (EP) remain underexplored. In this paper, we conduct a systematic analysis of AFD by extending the roofline model to the communication level, correlating interconnect bandwidth, arithmetic intensity, and Hardware FLOPS Utilization (HFU). Our analysis reveals a dead zone on standard clusters: increasing FFN instance count fails to improve HFU as computational workload is capped by scale-out bandwidth, causing operator active time to shrink relative to the fixed latency budget. We further show that AFD's discrete node-level scaling incurs higher imbalance penalties than EP's continuous batch adjustment. Nevertheless, these limitations diminish under specific conditions: Superpod-class hardware with abundant interconnect bandwidth and models with coarse-grained experts and lower sparsity are more likely to benefit from AFD. These findings position AFD as a promising approach for specific hardware-model combinations rather than a universal solution. |
| title | Revealing the Challenges of Attention-FFN Disaggregation for Modern MoE Models and Hardware Systems |
| topic | Distributed, Parallel, and Cluster Computing |
| url | https://arxiv.org/abs/2602.09721 |