Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Liu, Guowei, Li, Hongming, Guo, Yaning, Lyu, Yongxi, Zhou, Mo, Liu, Yi, Li, Zhaogeng, Wang, Yanpeng
Format:	Preprint
Published:	2026
Subjects:	Distributed, Parallel, and Cluster Computing
Online Access:	https://arxiv.org/abs/2602.09721
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914319130689536
author	Liu, Guowei Li, Hongming Guo, Yaning Lyu, Yongxi Zhou, Mo Liu, Yi Li, Zhaogeng Wang, Yanpeng
author_facet	Liu, Guowei Li, Hongming Guo, Yaning Lyu, Yongxi Zhou, Mo Liu, Yi Li, Zhaogeng Wang, Yanpeng
contents	Deploying large-scale MoE models presents challenges in memory capacity and bandwidth for expert activation. While Attention-FFN Disaggregation (AFD) has emerged as a potential architecture to decouple compute and memory resources, its performance boundaries compared to standard large-scale Expert Parallelism (EP) remain underexplored. In this paper, we conduct a systematic analysis of AFD by extending the roofline model to the communication level, correlating interconnect bandwidth, arithmetic intensity, and Hardware FLOPS Utilization (HFU). Our analysis reveals a dead zone on standard clusters: increasing FFN instance count fails to improve HFU as computational workload is capped by scale-out bandwidth, causing operator active time to shrink relative to the fixed latency budget. We further show that AFD's discrete node-level scaling incurs higher imbalance penalties than EP's continuous batch adjustment. Nevertheless, these limitations diminish under specific conditions: Superpod-class hardware with abundant interconnect bandwidth and models with coarse-grained experts and lower sparsity are more likely to benefit from AFD. These findings position AFD as a promising approach for specific hardware-model combinations rather than a universal solution.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_09721
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Revealing the Challenges of Attention-FFN Disaggregation for Modern MoE Models and Hardware Systems Liu, Guowei Li, Hongming Guo, Yaning Lyu, Yongxi Zhou, Mo Liu, Yi Li, Zhaogeng Wang, Yanpeng Distributed, Parallel, and Cluster Computing Deploying large-scale MoE models presents challenges in memory capacity and bandwidth for expert activation. While Attention-FFN Disaggregation (AFD) has emerged as a potential architecture to decouple compute and memory resources, its performance boundaries compared to standard large-scale Expert Parallelism (EP) remain underexplored. In this paper, we conduct a systematic analysis of AFD by extending the roofline model to the communication level, correlating interconnect bandwidth, arithmetic intensity, and Hardware FLOPS Utilization (HFU). Our analysis reveals a dead zone on standard clusters: increasing FFN instance count fails to improve HFU as computational workload is capped by scale-out bandwidth, causing operator active time to shrink relative to the fixed latency budget. We further show that AFD's discrete node-level scaling incurs higher imbalance penalties than EP's continuous batch adjustment. Nevertheless, these limitations diminish under specific conditions: Superpod-class hardware with abundant interconnect bandwidth and models with coarse-grained experts and lower sparsity are more likely to benefit from AFD. These findings position AFD as a promising approach for specific hardware-model combinations rather than a universal solution.
title	Revealing the Challenges of Attention-FFN Disaggregation for Modern MoE Models and Hardware Systems
topic	Distributed, Parallel, and Cluster Computing
url	https://arxiv.org/abs/2602.09721

Similar Items