Saved in:
Bibliographic Details
Main Authors: Zhang, Qijun, Zhang, Chen, Zhou, Zhuoshan, Wang, Haibo, Zhou, Zhe, Tu, Zhipeng, Sun, Guangyu, Xie, Zhiyao, Diao, Yijia, Ji, Zhigang, Leng, Jingwen, He, Guanghui, Guo, Minyi
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2605.05607
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866917466804846592
author Zhang, Qijun
Zhang, Chen
Zhou, Zhuoshan
Wang, Haibo
Zhou, Zhe
Tu, Zhipeng
Sun, Guangyu
Xie, Zhiyao
Diao, Yijia
Ji, Zhigang
Leng, Jingwen
He, Guanghui
Guo, Minyi
author_facet Zhang, Qijun
Zhang, Chen
Zhou, Zhuoshan
Wang, Haibo
Zhou, Zhe
Tu, Zhipeng
Sun, Guangyu
Xie, Zhiyao
Diao, Yijia
Ji, Zhigang
Leng, Jingwen
He, Guanghui
Guo, Minyi
contents Mixture-of-Experts (MoE) has been adopted by many leading large models to reduce computational requirements. However, frequent inter-GPU communication in MoE expert parallelism (EP) becomes a performance challenge. We observe substantial redundant inter-GPU data transfers in MoE that can be potentially addressed by in-switch computing. Unfortunately, the existing solution, NVLink SHARP (NVLS), can only support static collectives with regular patterns, incapable of dynamic communication with irregular patterns in MoE. To bridge the functionality gap, we propose DySHARP, an integral dynamic in-switch computing solution to accelerate MoE, encompassing both communication primitives and communication-aware scheduling: 1) Dynamic multimem addressing co-designs ISA, architecture, and runtime, as a dynamic extension to NVLS, reducing redundant traffic. However, the resulting traffic reduction is inherently asymmetric between two directions, preventing it from directly translating into speedup. 2) Token-centric kernel fusion deeply fuses the dispatch-computation-combine pipeline, resolving this asymmetry to translate traffic reduction into actual speedup. Compared with the state-of-the-art solution, DySHARP achieves up to 1.79$\times$ speedup.
format Preprint
id arxiv_https___arxiv_org_abs_2605_05607
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Accelerating MoE with Dynamic In-Switch Computing on Multi-GPUs
Zhang, Qijun
Zhang, Chen
Zhou, Zhuoshan
Wang, Haibo
Zhou, Zhe
Tu, Zhipeng
Sun, Guangyu
Xie, Zhiyao
Diao, Yijia
Ji, Zhigang
Leng, Jingwen
He, Guanghui
Guo, Minyi
Hardware Architecture
Distributed, Parallel, and Cluster Computing
Mixture-of-Experts (MoE) has been adopted by many leading large models to reduce computational requirements. However, frequent inter-GPU communication in MoE expert parallelism (EP) becomes a performance challenge. We observe substantial redundant inter-GPU data transfers in MoE that can be potentially addressed by in-switch computing. Unfortunately, the existing solution, NVLink SHARP (NVLS), can only support static collectives with regular patterns, incapable of dynamic communication with irregular patterns in MoE. To bridge the functionality gap, we propose DySHARP, an integral dynamic in-switch computing solution to accelerate MoE, encompassing both communication primitives and communication-aware scheduling: 1) Dynamic multimem addressing co-designs ISA, architecture, and runtime, as a dynamic extension to NVLS, reducing redundant traffic. However, the resulting traffic reduction is inherently asymmetric between two directions, preventing it from directly translating into speedup. 2) Token-centric kernel fusion deeply fuses the dispatch-computation-combine pipeline, resolving this asymmetry to translate traffic reduction into actual speedup. Compared with the state-of-the-art solution, DySHARP achieves up to 1.79$\times$ speedup.
title Accelerating MoE with Dynamic In-Switch Computing on Multi-GPUs
topic Hardware Architecture
Distributed, Parallel, and Cluster Computing
url https://arxiv.org/abs/2605.05607