Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2510.01185 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866914070637051904 |
|---|---|
| author | Mirvakhabova, Leyla Bejnordi, Babak Ehteshami Kumar, Gaurav Liang, Hanxue Zhao, Wanru Whatmough, Paul |
| author_facet | Mirvakhabova, Leyla Bejnordi, Babak Ehteshami Kumar, Gaurav Liang, Hanxue Zhao, Wanru Whatmough, Paul |
| contents | Upcycling pre-trained dense models into sparse Mixture-of-Experts (MoEs) efficiently increases model capacity but often suffers from poor expert specialization due to naive weight replication. Our analysis reveals that upcycled MoEs, even with conventional regularization, exhibit low-confidence, weakly differentiated routing, hindering performance. We introduce Dirichlet-Prior Shaping Loss (DPSL), a novel router regularization technique that directly shapes routing probability distributions by matching expert assignments to a target Dirichlet prior. DPSL offers fine-grained control over expert balance and specialization, and enables encoding of inductive biases such as encouraging experts to focus on specific modalities or tasks, without requiring manual intervention; notably, DPSL is a general tool applicable to any module that outputs categorical probability distributions, extending its utility beyond MoE training. Experiments on upcycled MoE vision-language models (with Qwen2, Phi3, Llama3.2 LLM backbones) show DPSL consistently outperforms upcycling strategies and regularization techniques across standard vision-language benchmarks, addressing the critical issue of poor specialization and fostering more adaptive, higher-performing models. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2510_01185 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | Dirichlet-Prior Shaping: Guiding Expert Specialization in Upcycled MoEs Mirvakhabova, Leyla Bejnordi, Babak Ehteshami Kumar, Gaurav Liang, Hanxue Zhao, Wanru Whatmough, Paul Machine Learning Upcycling pre-trained dense models into sparse Mixture-of-Experts (MoEs) efficiently increases model capacity but often suffers from poor expert specialization due to naive weight replication. Our analysis reveals that upcycled MoEs, even with conventional regularization, exhibit low-confidence, weakly differentiated routing, hindering performance. We introduce Dirichlet-Prior Shaping Loss (DPSL), a novel router regularization technique that directly shapes routing probability distributions by matching expert assignments to a target Dirichlet prior. DPSL offers fine-grained control over expert balance and specialization, and enables encoding of inductive biases such as encouraging experts to focus on specific modalities or tasks, without requiring manual intervention; notably, DPSL is a general tool applicable to any module that outputs categorical probability distributions, extending its utility beyond MoE training. Experiments on upcycled MoE vision-language models (with Qwen2, Phi3, Llama3.2 LLM backbones) show DPSL consistently outperforms upcycling strategies and regularization techniques across standard vision-language benchmarks, addressing the critical issue of poor specialization and fostering more adaptive, higher-performing models. |
| title | Dirichlet-Prior Shaping: Guiding Expert Specialization in Upcycled MoEs |
| topic | Machine Learning |
| url | https://arxiv.org/abs/2510.01185 |