Gardado en:
Detalles Bibliográficos
Autor Principal: Jin, Haopeng
Formato: Recurso digital
Idioma:inglés
Publicado: Zenodo 2026
Subjects:
Acceso en liña:https://doi.org/10.5281/zenodo.19712473
Tags: Engadir etiqueta
Sen Etiquetas, Sexa o primeiro en etiquetar este rexistro!
Table of Contents:
  • <p><strong>CARL-MoE</strong> technical report / preprint.</p><p>Sparse Mixture-of-Experts (MoE) models increase parameter capacity without proportional per-token computation by activating only a subset of experts for each token [Shazeer et al., 2017; Fedus et al., 2022; Du et al., 2022]. In practice, however, training efficiency is often limited by three coupled issues: topology-oblivious routing, uneven expert utilization, and expensive expert-parallel communication [Lepikhin et al., 2021; Rajbhandari et al., 2022; Gale et al., 2023]. Prior work has frequently improved routing, balancing, or distributed execution in isolation. This separation can create a mismatch between token-expert affinity and the actual cost of dispatching tokens across a heterogeneous cluster. We present a unified framework for efficient MoE training that combines: (i) communication-aware routing, which adjusts router utilities using estimated dispatch cost; (ii) adaptive dual-level load balancing, which regularizes both expert-level and group-level load and adjusts balancing strength based on observed skew; and (iii) communication-aware expert parallelism, including locality-biased hierarchical routing, a short Sinkhorn-based warm start, and periodic expert placement refresh using accumulated routing statistics. The contribution is primarily integrative rather than a claim of first invention of any single mechanism. We formulate the method precisely, analyze its computational trade-offs, and report simulation-based experiments with exact computed values under a transparent communication model. Across the studied settings, the integrated method reduces simulated communication cost and load skew relative to topology-oblivious baselines while preserving routing selectivity. These results support the broader systems-ML thesis that MoE routing should be co-designed with cluster topology rather than optimized independently.</p><p>Existing OSF archival DOI: 10.17605/OSF.IO/3MF56; Existing OSF archival page: https://osf.io/3mf56/.</p><p>Files include the technical report PDF and the LaTeX source tarball when available.</p>