Table of Contents: :: Library Catalog

Gardado en:

Detalles Bibliográficos
Autor Principal:	Jin, Haopeng
Formato:	Recurso digital
Idioma:	inglés
Publicado:	Zenodo 2026
Subjects:	CARL-MoE Communication-Aware Affinity Routing Dual-Level Adaptive Load Balancing Locality-Biased Hierarchical Expert Parallelism Congestion-Aware Sinkhorn Warm Start Topology-Aware Expert Placement Refresh GShard Switch Transformer GLaM BASE Layers machine learning deep learning
Acceso en liña:	https://doi.org/10.5281/zenodo.19712473
Tags:	Engadir etiqueta Sen Etiquetas, Sexa o primeiro en etiquetar este rexistro!

Table of Contents:

CARL-MoE technical report / preprint.Sparse Mixture-of-Experts (MoE) models increase parameter capacity without proportional per-token computation by activating only a subset of experts for each token [Shazeer et al., 2017; Fedus et al., 2022; Du et al., 2022]. In practice, however, training efficiency is often limited by three coupled issues: topology-oblivious routing, uneven expert utilization, and expensive expert-parallel communication [Lepikhin et al., 2021; Rajbhandari et al., 2022; Gale et al., 2023]. Prior work has frequently improved routing, balancing, or distributed execution in isolation. This separation can create a mismatch between token-expert affinity and the actual cost of dispatching tokens across a heterogeneous cluster. We present a unified framework for efficient MoE training that combines: (i) communication-aware routing, which adjusts router utilities using estimated dispatch cost; (ii) adaptive dual-level load balancing, which regularizes both expert-level and group-level load and adjusts balancing strength based on observed skew; and (iii) communication-aware expert parallelism, including locality-biased hierarchical routing, a short Sinkhorn-based warm start, and periodic expert placement refresh using accumulated routing statistics. The contribution is primarily integrative rather than a claim of first invention of any single mechanism. We formulate the method precisely, analyze its computational trade-offs, and report simulation-based experiments with exact computed values under a transparent communication model. Across the studied settings, the integrated method reduces simulated communication cost and load skew relative to topology-oblivious baselines while preserving routing selectivity. These results support the broader systems-ML thesis that MoE routing should be co-designed with cluster topology rather than optimized independently.Existing OSF archival DOI: 10.17605/OSF.IO/3MF56; Existing OSF archival page: https://osf.io/3mf56/.Files include the technical report PDF and the LaTeX source tarball when available.

Títulos similares