Saved in:
| Main Authors: | Wang, Siqi, Chen, Zhengyu, Li, Bei, He, Keqing, Zhang, Min, Wang, Jingang |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2410.05661 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Scaling and Transferability of Annealing Strategies in Large Language Model Training
by: Wang, Siqi, et al.
Published: (2025)
by: Wang, Siqi, et al.
Published: (2025)
Sub-Scaling Laws: On the Role of Data Density and Training Strategies in LLMs
by: Chen, Zhengyu, et al.
Published: (2025)
by: Chen, Zhengyu, et al.
Published: (2025)
Deconstructing Pre-training: Knowledge Attribution Analysis in MoE and Dense Models
by: Wang, Bo, et al.
Published: (2026)
by: Wang, Bo, et al.
Published: (2026)
Collaborative Compression for Large-Scale MoE Deployment on Edge
by: Chen, Yixiao, et al.
Published: (2025)
by: Chen, Yixiao, et al.
Published: (2025)
KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models
by: Xu, Zukang, et al.
Published: (2026)
by: Xu, Zukang, et al.
Published: (2026)
LocMoE: A Low-Overhead MoE for Large Language Model Training
by: Li, Jing, et al.
Published: (2024)
by: Li, Jing, et al.
Published: (2024)
Practical FP4 Training for Large-Scale MoE Models on Hopper GPUs
by: Zhang, Wuyue, et al.
Published: (2026)
by: Zhang, Wuyue, et al.
Published: (2026)
EAC-MoE: Expert-Selection Aware Compressor for Mixture-of-Experts Large Language Models
by: Chen, Yuanteng, et al.
Published: (2025)
by: Chen, Yuanteng, et al.
Published: (2025)
SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training
by: Tang, Shengkun, et al.
Published: (2026)
by: Tang, Shengkun, et al.
Published: (2026)
Expert Divergence Learning for MoE-based Language Models
by: Li, Jiaang, et al.
Published: (2026)
by: Li, Jiaang, et al.
Published: (2026)
Generalizing Scaling Laws for Dense and Sparse Large Language Models
by: Hossain, Md Arafat, et al.
Published: (2025)
by: Hossain, Md Arafat, et al.
Published: (2025)
MoEless: Efficient MoE LLM Serving via Serverless Computing
by: Yu, Hanfei, et al.
Published: (2026)
by: Yu, Hanfei, et al.
Published: (2026)
FFT-MoE: Efficient Federated Fine-Tuning for Foundation Models via Large-scale Sparse MoE under Heterogeneous Edge
by: Hu, Gang, et al.
Published: (2025)
by: Hu, Gang, et al.
Published: (2025)
MoE$^2$: Optimizing Collaborative Inference for Edge Large Language Models
by: Jin, Lyudong, et al.
Published: (2025)
by: Jin, Lyudong, et al.
Published: (2025)
Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts
by: Shi, Xiaoming, et al.
Published: (2024)
by: Shi, Xiaoming, et al.
Published: (2024)
Revisiting MoE and Dense Speed-Accuracy Comparisons for LLM Training
by: Du, Xianzhi, et al.
Published: (2024)
by: Du, Xianzhi, et al.
Published: (2024)
Piper: Efficient Large-Scale MoE Training via Resource Modeling and Pipelined Hybrid Parallelism
by: Dash, Sajal, et al.
Published: (2026)
by: Dash, Sajal, et al.
Published: (2026)
Hierarchical LoRA MoE for Efficient CTR Model Scaling
by: Zeng, Zhichen, et al.
Published: (2025)
by: Zeng, Zhichen, et al.
Published: (2025)
Analytical FFN-to-MoE Restructuring via Activation Pattern Analysis
by: Pei, Zehua, et al.
Published: (2025)
by: Pei, Zehua, et al.
Published: (2025)
Joint MoE Scaling Laws: Mixture of Experts Can Be Memory Efficient
by: Ludziejewski, Jan, et al.
Published: (2025)
by: Ludziejewski, Jan, et al.
Published: (2025)
GW-MoE: Resolving Uncertainty in MoE Router with Global Workspace Theory
by: Wu, Haoze, et al.
Published: (2024)
by: Wu, Haoze, et al.
Published: (2024)
Knowledge Editing on Black-box Large Language Models
by: Song, Xiaoshuai, et al.
Published: (2024)
by: Song, Xiaoshuai, et al.
Published: (2024)
Adaptive Normalization Mamba with Multi Scale Trend Decomposition and Patch MoE Encoding
by: Jeon, MinCheol
Published: (2025)
by: Jeon, MinCheol
Published: (2025)
MoE-Pruner: Pruning Mixture-of-Experts Large Language Model using the Hints from Its Router
by: Xie, Yanyue, et al.
Published: (2024)
by: Xie, Yanyue, et al.
Published: (2024)
LLaDA-MoE: A Sparse MoE Diffusion Language Model
by: Zhu, Fengqi, et al.
Published: (2025)
by: Zhu, Fengqi, et al.
Published: (2025)
Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data
by: Li, Yunxin, et al.
Published: (2025)
by: Li, Yunxin, et al.
Published: (2025)
Optimal Scaling Laws for Efficiency Gains in a Theoretical Transformer-Augmented Sectional MoE Framework
by: Sane, Soham
Published: (2025)
by: Sane, Soham
Published: (2025)
MEDNA-DFM: A Dual-View FiLM-MoE Model for Explainable DNA Methylation Prediction
by: He, Yi, et al.
Published: (2026)
by: He, Yi, et al.
Published: (2026)
Symphony-MoE: Harmonizing Disparate Pre-trained Models into a Coherent Mixture-of-Experts
by: Wang, Qi, et al.
Published: (2025)
by: Wang, Qi, et al.
Published: (2025)
MoE-PHDS: One MoE checkpoint for flexible runtime sparsity
by: Hannah, Lauren. A, et al.
Published: (2025)
by: Hannah, Lauren. A, et al.
Published: (2025)
On Token's Dilemma: Dynamic MoE with Drift-Aware Token Assignment for Continual Learning of Large Vision Language Models
by: Zhao, Chongyang, et al.
Published: (2026)
by: Zhao, Chongyang, et al.
Published: (2026)
MoEBlaze: Breaking the Memory Wall for Efficient MoE Training on Modern GPUs
by: Zhang, Jiyuan, et al.
Published: (2026)
by: Zhang, Jiyuan, et al.
Published: (2026)
Accelerating MoE Model Inference with Expert Sharding
by: Balmau, Oana, et al.
Published: (2025)
by: Balmau, Oana, et al.
Published: (2025)
Flex-MoE: Modeling Arbitrary Modality Combination via the Flexible Mixture-of-Experts
by: Yun, Sukwon, et al.
Published: (2024)
by: Yun, Sukwon, et al.
Published: (2024)
AquilaMoE: Efficient Training for MoE Models with Scale-Up and Scale-Out Strategies
by: Zhang, Bo-Wen, et al.
Published: (2024)
by: Zhang, Bo-Wen, et al.
Published: (2024)
DBES: A Systematic Benchmark and Metric Suite for Evaluating Expert Specialization in Large-Scale MoEs
by: Wang, Jing, et al.
Published: (2026)
by: Wang, Jing, et al.
Published: (2026)
GRIN: GRadient-INformed MoE
by: Liu, Liyuan, et al.
Published: (2024)
by: Liu, Liyuan, et al.
Published: (2024)
Spectral Manifold Regularization for Stable and Modular Routing in Deep MoE Architectures
by: Delibasoglu, Ibrahim
Published: (2026)
by: Delibasoglu, Ibrahim
Published: (2026)
Towards Causal Relationship in Indefinite Data: Baseline Model and New Datasets
by: Chen, Hang, et al.
Published: (2024)
by: Chen, Hang, et al.
Published: (2024)
Semantic Parallelism: Redefining Efficient MoE Inference via Model-Data Co-Scheduling
by: Li, Yan, et al.
Published: (2025)
by: Li, Yan, et al.
Published: (2025)
Similar Items
-
Scaling and Transferability of Annealing Strategies in Large Language Model Training
by: Wang, Siqi, et al.
Published: (2025) -
Sub-Scaling Laws: On the Role of Data Density and Training Strategies in LLMs
by: Chen, Zhengyu, et al.
Published: (2025) -
Deconstructing Pre-training: Knowledge Attribution Analysis in MoE and Dense Models
by: Wang, Bo, et al.
Published: (2026) -
Collaborative Compression for Large-Scale MoE Deployment on Edge
by: Chen, Yixiao, et al.
Published: (2025) -
KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models
by: Xu, Zukang, et al.
Published: (2026)