Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Chen, Hao Mark, Mo, Zhiwen, Lee, Royson, Wang, Qianzhou, Li, Da, Hu, Shell Xu, Luk, Wayne, Hospedales, Timothy, Fan, Hongxiang
Format:	Preprint
Published:	2026
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2602.00879
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914298843889664
author	Chen, Hao Mark Mo, Zhiwen Lee, Royson Wang, Qianzhou Li, Da Hu, Shell Xu Luk, Wayne Hospedales, Timothy Fan, Hongxiang
author_facet	Chen, Hao Mark Mo, Zhiwen Lee, Royson Wang, Qianzhou Li, Da Hu, Shell Xu Luk, Wayne Hospedales, Timothy Fan, Hongxiang
contents	Among parallel decoding paradigms, diffusion large language models (dLLMs) have emerged as a promising candidate that balances generation quality and throughput. However, their integration with Mixture-of-Experts (MoE) architectures is constrained by an expert explosion: as the number of tokens generated in parallel increases, the number of distinct experts activated grows nearly linearly. This results in substantial memory traffic that pushes inference into a memory-bound regime, negating the efficiency gains of both MoE and parallel decoding. To address this challenge, we propose Dynamic Expert Sharing (DES), a novel technique that shifts MoE optimization from token-centric pruning and conventional expert skipping methods to sequence-level coreset selection. To maximize expert reuse, DES identifies a compact, high-utility set of experts to satisfy the requirements of an entire parallel decoding block. We introduce two innovative selection strategies: (1) Intra-Sequence Sharing (DES-Seq), which adapts optimal allocation to the sequence level, and (2) Saliency-Aware Voting (DES-Vote), a novel mechanism that allows tokens to collectively elect a coreset based on aggregated router weights. Extensive experiments on MoE dLLMs demonstrate that DES reduces unique expert activations by over 55% and latency by up to 38%, while retaining 99% of vanilla accuracy, effectively decoupling memory overhead from the degree of parallelism.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_00879
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Dynamic Expert Sharing: Decoupling Memory from Parallelism in Mixture-of-Experts Diffusion LLMs Chen, Hao Mark Mo, Zhiwen Lee, Royson Wang, Qianzhou Li, Da Hu, Shell Xu Luk, Wayne Hospedales, Timothy Fan, Hongxiang Machine Learning Among parallel decoding paradigms, diffusion large language models (dLLMs) have emerged as a promising candidate that balances generation quality and throughput. However, their integration with Mixture-of-Experts (MoE) architectures is constrained by an expert explosion: as the number of tokens generated in parallel increases, the number of distinct experts activated grows nearly linearly. This results in substantial memory traffic that pushes inference into a memory-bound regime, negating the efficiency gains of both MoE and parallel decoding. To address this challenge, we propose Dynamic Expert Sharing (DES), a novel technique that shifts MoE optimization from token-centric pruning and conventional expert skipping methods to sequence-level coreset selection. To maximize expert reuse, DES identifies a compact, high-utility set of experts to satisfy the requirements of an entire parallel decoding block. We introduce two innovative selection strategies: (1) Intra-Sequence Sharing (DES-Seq), which adapts optimal allocation to the sequence level, and (2) Saliency-Aware Voting (DES-Vote), a novel mechanism that allows tokens to collectively elect a coreset based on aggregated router weights. Extensive experiments on MoE dLLMs demonstrate that DES reduces unique expert activations by over 55% and latency by up to 38%, while retaining 99% of vanilla accuracy, effectively decoupling memory overhead from the degree of parallelism.
title	Dynamic Expert Sharing: Decoupling Memory from Parallelism in Mixture-of-Experts Diffusion LLMs
topic	Machine Learning
url	https://arxiv.org/abs/2602.00879

Similar Items