Saved in:
Bibliographic Details
Main Authors: Liu, Guowei, Li, Hongming, Guo, Yaning, Lyu, Yongxi, Zhou, Mo, Liu, Yi, Li, Zhaogeng, Wang, Yanpeng
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.09721
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866914319130689536
author Liu, Guowei
Li, Hongming
Guo, Yaning
Lyu, Yongxi
Zhou, Mo
Liu, Yi
Li, Zhaogeng
Wang, Yanpeng
author_facet Liu, Guowei
Li, Hongming
Guo, Yaning
Lyu, Yongxi
Zhou, Mo
Liu, Yi
Li, Zhaogeng
Wang, Yanpeng
contents Deploying large-scale MoE models presents challenges in memory capacity and bandwidth for expert activation. While Attention-FFN Disaggregation (AFD) has emerged as a potential architecture to decouple compute and memory resources, its performance boundaries compared to standard large-scale Expert Parallelism (EP) remain underexplored. In this paper, we conduct a systematic analysis of AFD by extending the roofline model to the communication level, correlating interconnect bandwidth, arithmetic intensity, and Hardware FLOPS Utilization (HFU). Our analysis reveals a dead zone on standard clusters: increasing FFN instance count fails to improve HFU as computational workload is capped by scale-out bandwidth, causing operator active time to shrink relative to the fixed latency budget. We further show that AFD's discrete node-level scaling incurs higher imbalance penalties than EP's continuous batch adjustment. Nevertheless, these limitations diminish under specific conditions: Superpod-class hardware with abundant interconnect bandwidth and models with coarse-grained experts and lower sparsity are more likely to benefit from AFD. These findings position AFD as a promising approach for specific hardware-model combinations rather than a universal solution.
format Preprint
id arxiv_https___arxiv_org_abs_2602_09721
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Revealing the Challenges of Attention-FFN Disaggregation for Modern MoE Models and Hardware Systems
Liu, Guowei
Li, Hongming
Guo, Yaning
Lyu, Yongxi
Zhou, Mo
Liu, Yi
Li, Zhaogeng
Wang, Yanpeng
Distributed, Parallel, and Cluster Computing
Deploying large-scale MoE models presents challenges in memory capacity and bandwidth for expert activation. While Attention-FFN Disaggregation (AFD) has emerged as a potential architecture to decouple compute and memory resources, its performance boundaries compared to standard large-scale Expert Parallelism (EP) remain underexplored. In this paper, we conduct a systematic analysis of AFD by extending the roofline model to the communication level, correlating interconnect bandwidth, arithmetic intensity, and Hardware FLOPS Utilization (HFU). Our analysis reveals a dead zone on standard clusters: increasing FFN instance count fails to improve HFU as computational workload is capped by scale-out bandwidth, causing operator active time to shrink relative to the fixed latency budget. We further show that AFD's discrete node-level scaling incurs higher imbalance penalties than EP's continuous batch adjustment. Nevertheless, these limitations diminish under specific conditions: Superpod-class hardware with abundant interconnect bandwidth and models with coarse-grained experts and lower sparsity are more likely to benefit from AFD. These findings position AFD as a promising approach for specific hardware-model combinations rather than a universal solution.
title Revealing the Challenges of Attention-FFN Disaggregation for Modern MoE Models and Hardware Systems
topic Distributed, Parallel, and Cluster Computing
url https://arxiv.org/abs/2602.09721