Saved in:
Bibliographic Details
Main Authors: Mirvakhabova, Leyla, Bejnordi, Babak Ehteshami, Kumar, Gaurav, Liang, Hanxue, Zhao, Wanru, Whatmough, Paul
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2510.01185
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866914070637051904
author Mirvakhabova, Leyla
Bejnordi, Babak Ehteshami
Kumar, Gaurav
Liang, Hanxue
Zhao, Wanru
Whatmough, Paul
author_facet Mirvakhabova, Leyla
Bejnordi, Babak Ehteshami
Kumar, Gaurav
Liang, Hanxue
Zhao, Wanru
Whatmough, Paul
contents Upcycling pre-trained dense models into sparse Mixture-of-Experts (MoEs) efficiently increases model capacity but often suffers from poor expert specialization due to naive weight replication. Our analysis reveals that upcycled MoEs, even with conventional regularization, exhibit low-confidence, weakly differentiated routing, hindering performance. We introduce Dirichlet-Prior Shaping Loss (DPSL), a novel router regularization technique that directly shapes routing probability distributions by matching expert assignments to a target Dirichlet prior. DPSL offers fine-grained control over expert balance and specialization, and enables encoding of inductive biases such as encouraging experts to focus on specific modalities or tasks, without requiring manual intervention; notably, DPSL is a general tool applicable to any module that outputs categorical probability distributions, extending its utility beyond MoE training. Experiments on upcycled MoE vision-language models (with Qwen2, Phi3, Llama3.2 LLM backbones) show DPSL consistently outperforms upcycling strategies and regularization techniques across standard vision-language benchmarks, addressing the critical issue of poor specialization and fostering more adaptive, higher-performing models.
format Preprint
id arxiv_https___arxiv_org_abs_2510_01185
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Dirichlet-Prior Shaping: Guiding Expert Specialization in Upcycled MoEs
Mirvakhabova, Leyla
Bejnordi, Babak Ehteshami
Kumar, Gaurav
Liang, Hanxue
Zhao, Wanru
Whatmough, Paul
Machine Learning
Upcycling pre-trained dense models into sparse Mixture-of-Experts (MoEs) efficiently increases model capacity but often suffers from poor expert specialization due to naive weight replication. Our analysis reveals that upcycled MoEs, even with conventional regularization, exhibit low-confidence, weakly differentiated routing, hindering performance. We introduce Dirichlet-Prior Shaping Loss (DPSL), a novel router regularization technique that directly shapes routing probability distributions by matching expert assignments to a target Dirichlet prior. DPSL offers fine-grained control over expert balance and specialization, and enables encoding of inductive biases such as encouraging experts to focus on specific modalities or tasks, without requiring manual intervention; notably, DPSL is a general tool applicable to any module that outputs categorical probability distributions, extending its utility beyond MoE training. Experiments on upcycled MoE vision-language models (with Qwen2, Phi3, Llama3.2 LLM backbones) show DPSL consistently outperforms upcycling strategies and regularization techniques across standard vision-language benchmarks, addressing the critical issue of poor specialization and fostering more adaptive, higher-performing models.
title Dirichlet-Prior Shaping: Guiding Expert Specialization in Upcycled MoEs
topic Machine Learning
url https://arxiv.org/abs/2510.01185