Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhang, Qijun, Zhang, Chen, Zhou, Zhuoshan, Wang, Haibo, Zhou, Zhe, Tu, Zhipeng, Sun, Guangyu, Xie, Zhiyao, Diao, Yijia, Ji, Zhigang, Leng, Jingwen, He, Guanghui, Guo, Minyi
Format:	Preprint
Published:	2026
Subjects:	Hardware Architecture Distributed, Parallel, and Cluster Computing
Online Access:	https://arxiv.org/abs/2605.05607
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917466804846592
author	Zhang, Qijun Zhang, Chen Zhou, Zhuoshan Wang, Haibo Zhou, Zhe Tu, Zhipeng Sun, Guangyu Xie, Zhiyao Diao, Yijia Ji, Zhigang Leng, Jingwen He, Guanghui Guo, Minyi
author_facet	Zhang, Qijun Zhang, Chen Zhou, Zhuoshan Wang, Haibo Zhou, Zhe Tu, Zhipeng Sun, Guangyu Xie, Zhiyao Diao, Yijia Ji, Zhigang Leng, Jingwen He, Guanghui Guo, Minyi
contents	Mixture-of-Experts (MoE) has been adopted by many leading large models to reduce computational requirements. However, frequent inter-GPU communication in MoE expert parallelism (EP) becomes a performance challenge. We observe substantial redundant inter-GPU data transfers in MoE that can be potentially addressed by in-switch computing. Unfortunately, the existing solution, NVLink SHARP (NVLS), can only support static collectives with regular patterns, incapable of dynamic communication with irregular patterns in MoE. To bridge the functionality gap, we propose DySHARP, an integral dynamic in-switch computing solution to accelerate MoE, encompassing both communication primitives and communication-aware scheduling: 1) Dynamic multimem addressing co-designs ISA, architecture, and runtime, as a dynamic extension to NVLS, reducing redundant traffic. However, the resulting traffic reduction is inherently asymmetric between two directions, preventing it from directly translating into speedup. 2) Token-centric kernel fusion deeply fuses the dispatch-computation-combine pipeline, resolving this asymmetry to translate traffic reduction into actual speedup. Compared with the state-of-the-art solution, DySHARP achieves up to 1.79$\times$ speedup.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_05607
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Accelerating MoE with Dynamic In-Switch Computing on Multi-GPUs Zhang, Qijun Zhang, Chen Zhou, Zhuoshan Wang, Haibo Zhou, Zhe Tu, Zhipeng Sun, Guangyu Xie, Zhiyao Diao, Yijia Ji, Zhigang Leng, Jingwen He, Guanghui Guo, Minyi Hardware Architecture Distributed, Parallel, and Cluster Computing Mixture-of-Experts (MoE) has been adopted by many leading large models to reduce computational requirements. However, frequent inter-GPU communication in MoE expert parallelism (EP) becomes a performance challenge. We observe substantial redundant inter-GPU data transfers in MoE that can be potentially addressed by in-switch computing. Unfortunately, the existing solution, NVLink SHARP (NVLS), can only support static collectives with regular patterns, incapable of dynamic communication with irregular patterns in MoE. To bridge the functionality gap, we propose DySHARP, an integral dynamic in-switch computing solution to accelerate MoE, encompassing both communication primitives and communication-aware scheduling: 1) Dynamic multimem addressing co-designs ISA, architecture, and runtime, as a dynamic extension to NVLS, reducing redundant traffic. However, the resulting traffic reduction is inherently asymmetric between two directions, preventing it from directly translating into speedup. 2) Token-centric kernel fusion deeply fuses the dispatch-computation-combine pipeline, resolving this asymmetry to translate traffic reduction into actual speedup. Compared with the state-of-the-art solution, DySHARP achieves up to 1.79$\times$ speedup.
title	Accelerating MoE with Dynamic In-Switch Computing on Multi-GPUs
topic	Hardware Architecture Distributed, Parallel, and Cluster Computing
url	https://arxiv.org/abs/2605.05607

Similar Items