Saved in:
Bibliographic Details
Main Authors: Chen, Hao Mark, Mo, Zhiwen, Lee, Royson, Wang, Qianzhou, Li, Da, Hu, Shell Xu, Luk, Wayne, Hospedales, Timothy, Fan, Hongxiang
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.00879
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866914298843889664
author Chen, Hao Mark
Mo, Zhiwen
Lee, Royson
Wang, Qianzhou
Li, Da
Hu, Shell Xu
Luk, Wayne
Hospedales, Timothy
Fan, Hongxiang
author_facet Chen, Hao Mark
Mo, Zhiwen
Lee, Royson
Wang, Qianzhou
Li, Da
Hu, Shell Xu
Luk, Wayne
Hospedales, Timothy
Fan, Hongxiang
contents Among parallel decoding paradigms, diffusion large language models (dLLMs) have emerged as a promising candidate that balances generation quality and throughput. However, their integration with Mixture-of-Experts (MoE) architectures is constrained by an expert explosion: as the number of tokens generated in parallel increases, the number of distinct experts activated grows nearly linearly. This results in substantial memory traffic that pushes inference into a memory-bound regime, negating the efficiency gains of both MoE and parallel decoding. To address this challenge, we propose Dynamic Expert Sharing (DES), a novel technique that shifts MoE optimization from token-centric pruning and conventional expert skipping methods to sequence-level coreset selection. To maximize expert reuse, DES identifies a compact, high-utility set of experts to satisfy the requirements of an entire parallel decoding block. We introduce two innovative selection strategies: (1) Intra-Sequence Sharing (DES-Seq), which adapts optimal allocation to the sequence level, and (2) Saliency-Aware Voting (DES-Vote), a novel mechanism that allows tokens to collectively elect a coreset based on aggregated router weights. Extensive experiments on MoE dLLMs demonstrate that DES reduces unique expert activations by over 55% and latency by up to 38%, while retaining 99% of vanilla accuracy, effectively decoupling memory overhead from the degree of parallelism.
format Preprint
id arxiv_https___arxiv_org_abs_2602_00879
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Dynamic Expert Sharing: Decoupling Memory from Parallelism in Mixture-of-Experts Diffusion LLMs
Chen, Hao Mark
Mo, Zhiwen
Lee, Royson
Wang, Qianzhou
Li, Da
Hu, Shell Xu
Luk, Wayne
Hospedales, Timothy
Fan, Hongxiang
Machine Learning
Among parallel decoding paradigms, diffusion large language models (dLLMs) have emerged as a promising candidate that balances generation quality and throughput. However, their integration with Mixture-of-Experts (MoE) architectures is constrained by an expert explosion: as the number of tokens generated in parallel increases, the number of distinct experts activated grows nearly linearly. This results in substantial memory traffic that pushes inference into a memory-bound regime, negating the efficiency gains of both MoE and parallel decoding. To address this challenge, we propose Dynamic Expert Sharing (DES), a novel technique that shifts MoE optimization from token-centric pruning and conventional expert skipping methods to sequence-level coreset selection. To maximize expert reuse, DES identifies a compact, high-utility set of experts to satisfy the requirements of an entire parallel decoding block. We introduce two innovative selection strategies: (1) Intra-Sequence Sharing (DES-Seq), which adapts optimal allocation to the sequence level, and (2) Saliency-Aware Voting (DES-Vote), a novel mechanism that allows tokens to collectively elect a coreset based on aggregated router weights. Extensive experiments on MoE dLLMs demonstrate that DES reduces unique expert activations by over 55% and latency by up to 38%, while retaining 99% of vanilla accuracy, effectively decoupling memory overhead from the degree of parallelism.
title Dynamic Expert Sharing: Decoupling Memory from Parallelism in Mixture-of-Experts Diffusion LLMs
topic Machine Learning
url https://arxiv.org/abs/2602.00879