:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Liew, Seng Pei, Kato, Takuya, Takase, Sho
Format:	Preprint
Published:	2025
Subjects:	Machine Learning Computation and Language
Online Access:	https://arxiv.org/abs/2502.03009
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Reusing Overtrained Language Models Saturates Scaling
by: Liew, Seng Pei, et al.
Published: (2025)

Towards Principled Design of Mixture-of-Experts Language Models under Memory and Inference Constraints
by: Liew, Seng Pei, et al.
Published: (2026)

Upcycling Large Language Models into Mixture of Experts
by: He, Ethan, et al.
Published: (2024)

Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization
by: Nakamura, Taishi, et al.
Published: (2025)

MoIN: Mixture of Introvert Experts to Upcycle an LLM
by: Tejankar, Ajinkya, et al.
Published: (2024)

Scaling Laws for Fine-Grained Mixture of Experts
by: Krajewski, Jakub, et al.
Published: (2024)

Training-Free Dynamic Upcycling of Expert Language Models
by: Fanì, Eros, et al.
Published: (2026)

Large Vocabulary Size Improves Large Language Models
by: Takase, Sho, et al.
Published: (2024)

Towards a Comprehensive Scaling Law of Mixture-of-Experts
by: Zhao, Guoliang, et al.
Published: (2025)

Revisiting the Capacity Gap in Chain-of-Thought Distillation from a Practical Perspective
by: Kajitsuka, Tokio, et al.
Published: (2026)

MoLEx: Mixture of Layer Experts for Finetuning with Sparse Upcycling
by: Teo, Rachel S. Y., et al.
Published: (2025)

Upcycling Instruction Tuning from Dense to Mixture-of-Experts via Parameter Merging
by: Hui, Tingfeng, et al.
Published: (2024)

Joint MoE Scaling Laws: Mixture of Experts Can Be Memory Efficient
by: Ludziejewski, Jan, et al.
Published: (2025)

Bayesian Mixture of Experts For Large Language Models
by: Dialameh, Maryam, et al.
Published: (2025)

CMR Scaling Law: Predicting Critical Mixture Ratios for Continual Pre-training of Language Models
by: Gu, Jiawei, et al.
Published: (2024)

A Survey on Mixture of Experts in Large Language Models
by: Cai, Weilin, et al.
Published: (2024)

HMoE: Heterogeneous Mixture of Experts for Language Modeling
by: Wang, An, et al.
Published: (2024)

When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models
by: Yoon, Youngsik, et al.
Published: (2026)

Pre-training LLM without Learning Rate Decay Enhances Supervised Fine-Tuning
by: Yano, Kazuki, et al.
Published: (2026)

XFT: Unlocking the Power of Code Instruction Tuning by Simply Merging Upcycled Mixture-of-Experts
by: Ding, Yifeng, et al.
Published: (2024)

A Closer Look into Mixture-of-Experts in Large Language Models
by: Lo, Ka Man, et al.
Published: (2024)

Scaling Laws for Mixture Pretraining Under Data Constraints
by: Sedova, Anastasiia, et al.
Published: (2026)

Scaling Embeddings Outperforms Scaling Experts in Language Models
by: Liu, Hong, et al.
Published: (2026)

Mixture of Heterogeneous Grouped Experts for Language Modeling
by: Ma, Zhicheng, et al.
Published: (2026)

Parallel Scaling Law for Language Models
by: Chen, Mouxiang, et al.
Published: (2025)

Scaling Laws for Multilingual Language Models
by: He, Yifei, et al.
Published: (2024)

MoLAE: Mixture of Latent Experts for Parameter-Efficient Language Models
by: Liu, Zehua, et al.
Published: (2025)

MoSE: Mixture of Slimmable Experts for Efficient and Adaptive Language Models
by: Tastan, Nurbek, et al.
Published: (2026)

Chain-of-Experts: Unlocking the Communication Power of Mixture-of-Experts Models
by: Wang, Zihan, et al.
Published: (2025)

Pruning and Distilling Mixture-of-Experts into Dense Language Models
by: Kim, Junhyuck, et al.
Published: (2026)

OLMoE: Open Mixture-of-Experts Language Models
by: Muennighoff, Niklas, et al.
Published: (2024)

Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models
by: Lu, Xudong, et al.
Published: (2024)

Optimizing Pre-Training Data Mixtures with Mixtures of Data Expert Models
by: Belenki, Lior, et al.
Published: (2025)

Mixture of Lookup Experts
by: Jie, Shibo, et al.
Published: (2025)

Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training
by: Zhong, Zexuan, et al.
Published: (2024)

Every Expert Matters: Towards Effective Knowledge Distillation for Mixture-of-Experts Language Models
by: Kim, Gyeongman, et al.
Published: (2025)

MobileMoE: Scaling On-Device Mixture of Experts
by: Chen, Yanbei, et al.
Published: (2026)

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks
by: Nakamura, Taishi, et al.
Published: (2025)

TAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer in Language Models
by: Shing, Makoto, et al.
Published: (2025)

GatePro: Parameter-Free Expert Selection Optimization for Mixture-of-Experts Models
by: Zheng, Chen, et al.
Published: (2025)