Saved in:
| Main Authors: | Liu, Guowei, Li, Hongming, Guo, Yaning, Lyu, Yongxi, Zhou, Mo, Liu, Yi, Li, Zhaogeng, Wang, Yanpeng |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.09721 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Janus: Disaggregating Attention and Experts for Scalable MoE Inference
by: Zhang, Zhexiang, et al.
Published: (2025)
by: Zhang, Zhexiang, et al.
Published: (2025)
How Far Can Disaggregation Go? A Design-Space Exploration of Attention-FFN Disaggregation for Efficient MoE LLM Serving
by: Wu, Hanjiang, et al.
Published: (2026)
by: Wu, Hanjiang, et al.
Published: (2026)
LSH-MoE: Communication-efficient MoE Training via Locality-Sensitive Hashing
by: Nie, Xiaonan, et al.
Published: (2024)
by: Nie, Xiaonan, et al.
Published: (2024)
ReviveMoE: Fast Recovery for Hardware Failures in Large-Scale MoE LLM Inference Deployments
by: Li, Haley, et al.
Published: (2026)
by: Li, Haley, et al.
Published: (2026)
OD-MoE: On-Demand Expert Loading for Cacheless Edge-Distributed MoE Inference
by: Wang, Liujianfu, et al.
Published: (2025)
by: Wang, Liujianfu, et al.
Published: (2025)
Efficient MoE Inference with Fine-Grained Scheduling of Disaggregated Expert Parallelism
by: Pan, Xinglin, et al.
Published: (2025)
by: Pan, Xinglin, et al.
Published: (2025)
MoE-Hub: Taming Software Complexity for Seamless MoE Overlap with Hardware-Accelerated Communication on Multi-GPU Systems
by: Zhou, Zhuoshan, et al.
Published: (2026)
by: Zhou, Zhuoshan, et al.
Published: (2026)
MoE-Lens: Towards the Hardware Limit of High-Throughput MoE LLM Serving Under Resource Constraints
by: Yuan, Yichao, et al.
Published: (2025)
by: Yuan, Yichao, et al.
Published: (2025)
SP-MoE: Speculative Decoding and Prefetching for Accelerating MoE-based Model Inference
by: Chen, Liangkun, et al.
Published: (2025)
by: Chen, Liangkun, et al.
Published: (2025)
Accelerating Distributed MoE Training and Inference with Lina
by: Li, Jiamin, et al.
Published: (2022)
by: Li, Jiamin, et al.
Published: (2022)
GRACE-MoE: Grouping and Replication with Locality-Aware Routing for Efficient Distributed MoE Inference
by: Han, Yu, et al.
Published: (2025)
by: Han, Yu, et al.
Published: (2025)
ReaLB: Real-Time Load Balancing for Multimodal MoE Inference
by: Wang, Yingping, et al.
Published: (2026)
by: Wang, Yingping, et al.
Published: (2026)
DisagMoE: Computation-Communication overlapped MoE Training via Disaggregated AF-Pipe Parallelism
by: Zeng, Zhichen, et al.
Published: (2026)
by: Zeng, Zhichen, et al.
Published: (2026)
MoEBlaze: Breaking the Memory Wall for Efficient MoE Training on Modern GPUs
by: Zhang, Jiyuan, et al.
Published: (2026)
by: Zhang, Jiyuan, et al.
Published: (2026)
Efficient CPU-GPU Collaborative Inference for MoE-based LLMs on Memory-Limited Systems
by: Huang, En-Ming, et al.
Published: (2025)
by: Huang, En-Ming, et al.
Published: (2025)
EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference
by: Qian, Yulei, et al.
Published: (2024)
by: Qian, Yulei, et al.
Published: (2024)
ViBE: Co-Optimizing Workload Skew and Hardware Variability for MoE Serving
by: Go, Seokjin, et al.
Published: (2026)
by: Go, Seokjin, et al.
Published: (2026)
Hexa-MoE: Efficient and Heterogeneous-aware Training for Mixture-of-Experts
by: Luo, Shuqing, et al.
Published: (2024)
by: Luo, Shuqing, et al.
Published: (2024)
MoE-Compression: How the Compression Error of Experts Affects the Inference Accuracy of MoE Model?
by: Ma, Songkai, et al.
Published: (2025)
by: Ma, Songkai, et al.
Published: (2025)
MoE Parallel Folding: Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training with Megatron Core
by: Liu, Dennis, et al.
Published: (2025)
by: Liu, Dennis, et al.
Published: (2025)
D$^{2}$MoE: Dual Routing and Dynamic Scheduling for Efficient On-Device MoE-based LLM Serving
by: Wang, Haodong, et al.
Published: (2025)
by: Wang, Haodong, et al.
Published: (2025)
Sparse Checkpointing for Fast and Reliable MoE Training
by: Gandhi, Swapnil, et al.
Published: (2024)
by: Gandhi, Swapnil, et al.
Published: (2024)
Pro-Prophet: A Systematic Load Balancing Method for Efficient Parallel Training of Large-scale MoE Models
by: Wang, Wei, et al.
Published: (2024)
by: Wang, Wei, et al.
Published: (2024)
FEPLB: Exploiting Copy Engines for Nearly Free MoE Load Balancing in Distributed Training
by: Qi, Shuyao, et al.
Published: (2026)
by: Qi, Shuyao, et al.
Published: (2026)
HarMoEny: Efficient Multi-GPU Inference of MoE Models
by: Doucet, Zachary, et al.
Published: (2025)
by: Doucet, Zachary, et al.
Published: (2025)
UniEP: Unified Expert-Parallel MoE MegaKernel for LLM Training
by: Zheng, Size, et al.
Published: (2026)
by: Zheng, Size, et al.
Published: (2026)
Fine-grained MoE Load Balancing with Linear Programming
by: Zhao, Chenqi, et al.
Published: (2025)
by: Zhao, Chenqi, et al.
Published: (2025)
Multi-Layer Scheduling for MoE-Based LLM Reasoning
by: Sun, Yifan, et al.
Published: (2026)
by: Sun, Yifan, et al.
Published: (2026)
Staleness-Centric Optimizations for Parallel Diffusion MoE Inference
by: Luo, Jiajun, et al.
Published: (2024)
by: Luo, Jiajun, et al.
Published: (2024)
MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs
by: Cao, Shiyi, et al.
Published: (2024)
by: Cao, Shiyi, et al.
Published: (2024)
HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference
by: Tang, Peng, et al.
Published: (2024)
by: Tang, Peng, et al.
Published: (2024)
When MoE Meets Blockchain: A Trustworthy Distributed Framework of Large Models
by: Zhu, Weihao, et al.
Published: (2025)
by: Zhu, Weihao, et al.
Published: (2025)
Remoe: Towards Efficient and Low-Cost MoE Inference in Serverless Computing
by: Liu, Wentao, et al.
Published: (2025)
by: Liu, Wentao, et al.
Published: (2025)
INDIGO: Page Migration for Hardware Memory Disaggregation Across a Network
by: Patke, Archit, et al.
Published: (2025)
by: Patke, Archit, et al.
Published: (2025)
MPipeMoE: Memory Efficient MoE for Pre-trained Models with Adaptive Pipeline Parallelism
by: Zhang, Zheng, et al.
Published: (2025)
by: Zhang, Zheng, et al.
Published: (2025)
Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference
by: Sun, Xun, et al.
Published: (2026)
by: Sun, Xun, et al.
Published: (2026)
PROBE: Co-Balancing Computation and Communication in MoE Inference via Real-Time Predictive Prefetching
by: Zhu, Qianchao, et al.
Published: (2026)
by: Zhu, Qianchao, et al.
Published: (2026)
DALI: A Workload-Aware Offloading Framework for Efficient MoE Inference on Local PCs
by: Zhu, Zeyu, et al.
Published: (2026)
by: Zhu, Zeyu, et al.
Published: (2026)
Expert-as-a-Service: Towards Efficient, Scalable, and Robust Large-scale MoE Serving
by: Liu, Ziming, et al.
Published: (2025)
by: Liu, Ziming, et al.
Published: (2025)
MemFine: Memory-Aware Fine-Grained Scheduling for MoE Training
by: Zhao, Lu, et al.
Published: (2025)
by: Zhao, Lu, et al.
Published: (2025)
Similar Items
-
Janus: Disaggregating Attention and Experts for Scalable MoE Inference
by: Zhang, Zhexiang, et al.
Published: (2025) -
How Far Can Disaggregation Go? A Design-Space Exploration of Attention-FFN Disaggregation for Efficient MoE LLM Serving
by: Wu, Hanjiang, et al.
Published: (2026) -
LSH-MoE: Communication-efficient MoE Training via Locality-Sensitive Hashing
by: Nie, Xiaonan, et al.
Published: (2024) -
ReviveMoE: Fast Recovery for Hardware Failures in Large-Scale MoE LLM Inference Deployments
by: Li, Haley, et al.
Published: (2026) -
OD-MoE: On-Demand Expert Loading for Cacheless Edge-Distributed MoE Inference
by: Wang, Liujianfu, et al.
Published: (2025)