Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Author:	Mouzouni, Charafeddine
Format:	Preprint
Published:	2026
Subjects:	Machine Learning Artificial Intelligence Multiagent Systems 91A13, 91A10, 68T07 I.2.6; G.3
Online Access:	https://arxiv.org/abs/2604.04230
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866918429830676480
author	Mouzouni, Charafeddine
author_facet	Mouzouni, Charafeddine
contents	We model Mixture-of-Experts (MoE) token routing as a congestion game with a single effective parameter, the congestion coefficient gamma_eff, that quantifies the balance-quality tradeoff. Tracking gamma_eff across training checkpoints of two open-source MoE models, OLMoE-1B-7B (20 checkpoints, with dense sampling in the surge region) and OpenMoE-8B (6 checkpoints), reveals a three-phase trajectory: a surge phase where the router learns to balance load (gamma_eff: 14 to 36-39, peaking in the step 30K-40K region), a stabilization phase where experts specialize under steady balance (B_0: 2.4 to 2.3, steps 100K-400K), and a relaxation phase where the router trades balance for quality as experts differentiate (gamma_eff: 27 to 9, steps 400K-1.2M). This non-monotone trajectory, invisible to post-hoc analysis of converged models, reveals that early MoE training prioritizes balance while late training prioritizes quality. The theoretical framework is honest about its limits: the single-type equilibrium reduces to temperature-scaled softmax (held-out L1: MFG = 0.199 vs. softmax = 0.200). The game is not a better predictor; it reveals what the temperature means and, critically, how that temperature evolves. We complement the dynamics with an effective congestion decomposition, a multi-type extension that improves load prediction via token clustering on all 16 layers (mean: 30%), scope diagnostics (K/M, epsilon_l), and robustness verification across four independent quality estimators (r >= 0.89). All confidence intervals are from bootstrap resampling over 50 independent text batches.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_04230
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Three Phases of Expert Routing: How Load Balance Evolves During Mixture-of-Experts Training Mouzouni, Charafeddine Machine Learning Artificial Intelligence Multiagent Systems 91A13, 91A10, 68T07 I.2.6; G.3 We model Mixture-of-Experts (MoE) token routing as a congestion game with a single effective parameter, the congestion coefficient gamma_eff, that quantifies the balance-quality tradeoff. Tracking gamma_eff across training checkpoints of two open-source MoE models, OLMoE-1B-7B (20 checkpoints, with dense sampling in the surge region) and OpenMoE-8B (6 checkpoints), reveals a three-phase trajectory: a surge phase where the router learns to balance load (gamma_eff: 14 to 36-39, peaking in the step 30K-40K region), a stabilization phase where experts specialize under steady balance (B_0: 2.4 to 2.3, steps 100K-400K), and a relaxation phase where the router trades balance for quality as experts differentiate (gamma_eff: 27 to 9, steps 400K-1.2M). This non-monotone trajectory, invisible to post-hoc analysis of converged models, reveals that early MoE training prioritizes balance while late training prioritizes quality. The theoretical framework is honest about its limits: the single-type equilibrium reduces to temperature-scaled softmax (held-out L1: MFG = 0.199 vs. softmax = 0.200). The game is not a better predictor; it reveals what the temperature means and, critically, how that temperature evolves. We complement the dynamics with an effective congestion decomposition, a multi-type extension that improves load prediction via token clustering on all 16 layers (mean: 30%), scope diagnostics (K/M, epsilon_l), and robustness verification across four independent quality estimators (r >= 0.89). All confidence intervals are from bootstrap resampling over 50 independent text batches.
title	Three Phases of Expert Routing: How Load Balance Evolves During Mixture-of-Experts Training
topic	Machine Learning Artificial Intelligence Multiagent Systems 91A13, 91A10, 68T07 I.2.6; G.3
url	https://arxiv.org/abs/2604.04230

Similar Items