Saved in:
| Main Authors: | Liew, Seng Pei, Kato, Takuya, Takase, Sho |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2502.03009 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Reusing Overtrained Language Models Saturates Scaling
by: Liew, Seng Pei, et al.
Published: (2025)
by: Liew, Seng Pei, et al.
Published: (2025)
Towards Principled Design of Mixture-of-Experts Language Models under Memory and Inference Constraints
by: Liew, Seng Pei, et al.
Published: (2026)
by: Liew, Seng Pei, et al.
Published: (2026)
Upcycling Large Language Models into Mixture of Experts
by: He, Ethan, et al.
Published: (2024)
by: He, Ethan, et al.
Published: (2024)
Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization
by: Nakamura, Taishi, et al.
Published: (2025)
by: Nakamura, Taishi, et al.
Published: (2025)
MoIN: Mixture of Introvert Experts to Upcycle an LLM
by: Tejankar, Ajinkya, et al.
Published: (2024)
by: Tejankar, Ajinkya, et al.
Published: (2024)
Scaling Laws for Fine-Grained Mixture of Experts
by: Krajewski, Jakub, et al.
Published: (2024)
by: Krajewski, Jakub, et al.
Published: (2024)
Training-Free Dynamic Upcycling of Expert Language Models
by: Fanì, Eros, et al.
Published: (2026)
by: Fanì, Eros, et al.
Published: (2026)
Large Vocabulary Size Improves Large Language Models
by: Takase, Sho, et al.
Published: (2024)
by: Takase, Sho, et al.
Published: (2024)
Towards a Comprehensive Scaling Law of Mixture-of-Experts
by: Zhao, Guoliang, et al.
Published: (2025)
by: Zhao, Guoliang, et al.
Published: (2025)
Revisiting the Capacity Gap in Chain-of-Thought Distillation from a Practical Perspective
by: Kajitsuka, Tokio, et al.
Published: (2026)
by: Kajitsuka, Tokio, et al.
Published: (2026)
MoLEx: Mixture of Layer Experts for Finetuning with Sparse Upcycling
by: Teo, Rachel S. Y., et al.
Published: (2025)
by: Teo, Rachel S. Y., et al.
Published: (2025)
Upcycling Instruction Tuning from Dense to Mixture-of-Experts via Parameter Merging
by: Hui, Tingfeng, et al.
Published: (2024)
by: Hui, Tingfeng, et al.
Published: (2024)
Joint MoE Scaling Laws: Mixture of Experts Can Be Memory Efficient
by: Ludziejewski, Jan, et al.
Published: (2025)
by: Ludziejewski, Jan, et al.
Published: (2025)
Bayesian Mixture of Experts For Large Language Models
by: Dialameh, Maryam, et al.
Published: (2025)
by: Dialameh, Maryam, et al.
Published: (2025)
CMR Scaling Law: Predicting Critical Mixture Ratios for Continual Pre-training of Language Models
by: Gu, Jiawei, et al.
Published: (2024)
by: Gu, Jiawei, et al.
Published: (2024)
A Survey on Mixture of Experts in Large Language Models
by: Cai, Weilin, et al.
Published: (2024)
by: Cai, Weilin, et al.
Published: (2024)
HMoE: Heterogeneous Mixture of Experts for Language Modeling
by: Wang, An, et al.
Published: (2024)
by: Wang, An, et al.
Published: (2024)
When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models
by: Yoon, Youngsik, et al.
Published: (2026)
by: Yoon, Youngsik, et al.
Published: (2026)
Pre-training LLM without Learning Rate Decay Enhances Supervised Fine-Tuning
by: Yano, Kazuki, et al.
Published: (2026)
by: Yano, Kazuki, et al.
Published: (2026)
XFT: Unlocking the Power of Code Instruction Tuning by Simply Merging Upcycled Mixture-of-Experts
by: Ding, Yifeng, et al.
Published: (2024)
by: Ding, Yifeng, et al.
Published: (2024)
A Closer Look into Mixture-of-Experts in Large Language Models
by: Lo, Ka Man, et al.
Published: (2024)
by: Lo, Ka Man, et al.
Published: (2024)
Scaling Laws for Mixture Pretraining Under Data Constraints
by: Sedova, Anastasiia, et al.
Published: (2026)
by: Sedova, Anastasiia, et al.
Published: (2026)
Scaling Embeddings Outperforms Scaling Experts in Language Models
by: Liu, Hong, et al.
Published: (2026)
by: Liu, Hong, et al.
Published: (2026)
Mixture of Heterogeneous Grouped Experts for Language Modeling
by: Ma, Zhicheng, et al.
Published: (2026)
by: Ma, Zhicheng, et al.
Published: (2026)
Parallel Scaling Law for Language Models
by: Chen, Mouxiang, et al.
Published: (2025)
by: Chen, Mouxiang, et al.
Published: (2025)
Scaling Laws for Multilingual Language Models
by: He, Yifei, et al.
Published: (2024)
by: He, Yifei, et al.
Published: (2024)
MoLAE: Mixture of Latent Experts for Parameter-Efficient Language Models
by: Liu, Zehua, et al.
Published: (2025)
by: Liu, Zehua, et al.
Published: (2025)
MoSE: Mixture of Slimmable Experts for Efficient and Adaptive Language Models
by: Tastan, Nurbek, et al.
Published: (2026)
by: Tastan, Nurbek, et al.
Published: (2026)
Chain-of-Experts: Unlocking the Communication Power of Mixture-of-Experts Models
by: Wang, Zihan, et al.
Published: (2025)
by: Wang, Zihan, et al.
Published: (2025)
Pruning and Distilling Mixture-of-Experts into Dense Language Models
by: Kim, Junhyuck, et al.
Published: (2026)
by: Kim, Junhyuck, et al.
Published: (2026)
OLMoE: Open Mixture-of-Experts Language Models
by: Muennighoff, Niklas, et al.
Published: (2024)
by: Muennighoff, Niklas, et al.
Published: (2024)
Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models
by: Lu, Xudong, et al.
Published: (2024)
by: Lu, Xudong, et al.
Published: (2024)
Optimizing Pre-Training Data Mixtures with Mixtures of Data Expert Models
by: Belenki, Lior, et al.
Published: (2025)
by: Belenki, Lior, et al.
Published: (2025)
Mixture of Lookup Experts
by: Jie, Shibo, et al.
Published: (2025)
by: Jie, Shibo, et al.
Published: (2025)
Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training
by: Zhong, Zexuan, et al.
Published: (2024)
by: Zhong, Zexuan, et al.
Published: (2024)
Every Expert Matters: Towards Effective Knowledge Distillation for Mixture-of-Experts Language Models
by: Kim, Gyeongman, et al.
Published: (2025)
by: Kim, Gyeongman, et al.
Published: (2025)
MobileMoE: Scaling On-Device Mixture of Experts
by: Chen, Yanbei, et al.
Published: (2026)
by: Chen, Yanbei, et al.
Published: (2026)
Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks
by: Nakamura, Taishi, et al.
Published: (2025)
by: Nakamura, Taishi, et al.
Published: (2025)
TAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer in Language Models
by: Shing, Makoto, et al.
Published: (2025)
by: Shing, Makoto, et al.
Published: (2025)
GatePro: Parameter-Free Expert Selection Optimization for Mixture-of-Experts Models
by: Zheng, Chen, et al.
Published: (2025)
by: Zheng, Chen, et al.
Published: (2025)
Similar Items
-
Reusing Overtrained Language Models Saturates Scaling
by: Liew, Seng Pei, et al.
Published: (2025) -
Towards Principled Design of Mixture-of-Experts Language Models under Memory and Inference Constraints
by: Liew, Seng Pei, et al.
Published: (2026) -
Upcycling Large Language Models into Mixture of Experts
by: He, Ethan, et al.
Published: (2024) -
Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization
by: Nakamura, Taishi, et al.
Published: (2025) -
MoIN: Mixture of Introvert Experts to Upcycle an LLM
by: Tejankar, Ajinkya, et al.
Published: (2024)