Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Hu, Rizhen, Cao, Yuan, Kong, Boao, Sun, Mou, Yuan, Kun
Format:	Preprint
Published:	2026
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2602.14159
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910023201849344
author	Hu, Rizhen Cao, Yuan Kong, Boao Sun, Mou Yuan, Kun
author_facet	Hu, Rizhen Cao, Yuan Kong, Boao Sun, Mou Yuan, Kun
contents	Sparse Mixture-of-Experts (MoE) models scale Transformers efficiently but suffer from expert overlap -- redundant representations across experts and routing ambiguity, resulting in severely underutilized model capacity. While architectural solutions like DeepSeekMoE promote specialization, they require substantial structural modifications and rely solely on intra-layer signals. In this paper, we propose two plug-and-play regularization losses that enhance MoE specialization and routing efficiency without modifying router or model architectures. First, an intra-layer specialization loss penalizes cosine similarity between experts' SwiGLU activations on identical tokens, encouraging experts to specialize in complementary knowledge. Second, a cross-layer coupling loss maximizes joint Top-$k$ routing probabilities across adjacent layers, establishing coherent expert pathways through network depth while reinforcing intra-layer expert specialization. Both losses are orthogonal to the standard load-balancing loss and compatible with both the shared-expert architecture in DeepSeekMoE and vanilla top-$k$ MoE architectures. We implement both losses as a drop-in Megatron-LM module. Extensive experiments across pre-training, fine-tuning, and zero-shot benchmarks demonstrate consistent task gains, higher expert specialization, and lower-entropy routing; together, these improvements translate into faster inference via more stable expert pathways.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_14159
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Synergistic Intra- and Cross-Layer Regularization Losses for MoE Expert Specialization Hu, Rizhen Cao, Yuan Kong, Boao Sun, Mou Yuan, Kun Machine Learning Sparse Mixture-of-Experts (MoE) models scale Transformers efficiently but suffer from expert overlap -- redundant representations across experts and routing ambiguity, resulting in severely underutilized model capacity. While architectural solutions like DeepSeekMoE promote specialization, they require substantial structural modifications and rely solely on intra-layer signals. In this paper, we propose two plug-and-play regularization losses that enhance MoE specialization and routing efficiency without modifying router or model architectures. First, an intra-layer specialization loss penalizes cosine similarity between experts' SwiGLU activations on identical tokens, encouraging experts to specialize in complementary knowledge. Second, a cross-layer coupling loss maximizes joint Top-$k$ routing probabilities across adjacent layers, establishing coherent expert pathways through network depth while reinforcing intra-layer expert specialization. Both losses are orthogonal to the standard load-balancing loss and compatible with both the shared-expert architecture in DeepSeekMoE and vanilla top-$k$ MoE architectures. We implement both losses as a drop-in Megatron-LM module. Extensive experiments across pre-training, fine-tuning, and zero-shot benchmarks demonstrate consistent task gains, higher expert specialization, and lower-entropy routing; together, these improvements translate into faster inference via more stable expert pathways.
title	Synergistic Intra- and Cross-Layer Regularization Losses for MoE Expert Specialization
topic	Machine Learning
url	https://arxiv.org/abs/2602.14159

Similar Items