Saved in:
Bibliographic Details
Main Authors: Hu, Rizhen, Cao, Yuan, Kong, Boao, Sun, Mou, Yuan, Kun
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.14159
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866910023201849344
author Hu, Rizhen
Cao, Yuan
Kong, Boao
Sun, Mou
Yuan, Kun
author_facet Hu, Rizhen
Cao, Yuan
Kong, Boao
Sun, Mou
Yuan, Kun
contents Sparse Mixture-of-Experts (MoE) models scale Transformers efficiently but suffer from expert overlap -- redundant representations across experts and routing ambiguity, resulting in severely underutilized model capacity. While architectural solutions like DeepSeekMoE promote specialization, they require substantial structural modifications and rely solely on intra-layer signals. In this paper, we propose two plug-and-play regularization losses that enhance MoE specialization and routing efficiency without modifying router or model architectures. First, an intra-layer specialization loss penalizes cosine similarity between experts' SwiGLU activations on identical tokens, encouraging experts to specialize in complementary knowledge. Second, a cross-layer coupling loss maximizes joint Top-$k$ routing probabilities across adjacent layers, establishing coherent expert pathways through network depth while reinforcing intra-layer expert specialization. Both losses are orthogonal to the standard load-balancing loss and compatible with both the shared-expert architecture in DeepSeekMoE and vanilla top-$k$ MoE architectures. We implement both losses as a drop-in Megatron-LM module. Extensive experiments across pre-training, fine-tuning, and zero-shot benchmarks demonstrate consistent task gains, higher expert specialization, and lower-entropy routing; together, these improvements translate into faster inference via more stable expert pathways.
format Preprint
id arxiv_https___arxiv_org_abs_2602_14159
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Synergistic Intra- and Cross-Layer Regularization Losses for MoE Expert Specialization
Hu, Rizhen
Cao, Yuan
Kong, Boao
Sun, Mou
Yuan, Kun
Machine Learning
Sparse Mixture-of-Experts (MoE) models scale Transformers efficiently but suffer from expert overlap -- redundant representations across experts and routing ambiguity, resulting in severely underutilized model capacity. While architectural solutions like DeepSeekMoE promote specialization, they require substantial structural modifications and rely solely on intra-layer signals. In this paper, we propose two plug-and-play regularization losses that enhance MoE specialization and routing efficiency without modifying router or model architectures. First, an intra-layer specialization loss penalizes cosine similarity between experts' SwiGLU activations on identical tokens, encouraging experts to specialize in complementary knowledge. Second, a cross-layer coupling loss maximizes joint Top-$k$ routing probabilities across adjacent layers, establishing coherent expert pathways through network depth while reinforcing intra-layer expert specialization. Both losses are orthogonal to the standard load-balancing loss and compatible with both the shared-expert architecture in DeepSeekMoE and vanilla top-$k$ MoE architectures. We implement both losses as a drop-in Megatron-LM module. Extensive experiments across pre-training, fine-tuning, and zero-shot benchmarks demonstrate consistent task gains, higher expert specialization, and lower-entropy routing; together, these improvements translate into faster inference via more stable expert pathways.
title Synergistic Intra- and Cross-Layer Regularization Losses for MoE Expert Specialization
topic Machine Learning
url https://arxiv.org/abs/2602.14159