:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zhang, Songyu, Tam, Aaron, Lee, Myungjin, Qi, Shixiong, Ramakrishnan, K. K.
Format:	Preprint
Published:	2026
Subjects:	Distributed, Parallel, and Cluster Computing Machine Learning
Online Access:	https://arxiv.org/abs/2601.01310
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

LIFL: A Lightweight, Event-driven Serverless Platform for Federated Learning
by: Qi, Shixiong, et al.
Published: (2024)

MoE-Compression: How the Compression Error of Experts Affects the Inference Accuracy of MoE Model?
by: Ma, Songkai, et al.
Published: (2025)

MoE-Gen: High-Throughput MoE Inference on a Single GPU with Module-Based Batching
by: Xu, Tairan, et al.
Published: (2025)

MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs
by: Cao, Shiyi, et al.
Published: (2024)

HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference
by: Zhong, Shuzhang, et al.
Published: (2025)

DAOP: Data-Aware Offloading and Predictive Pre-Calculation for Efficient MoE Inference
by: Zhang, Yujie, et al.
Published: (2024)

eMoE: Task-aware Memory Efficient Mixture-of-Experts-Based (MoE) Model Inference
by: Tairin, Suraiya, et al.
Published: (2025)

Priority-Aware Preemptive Scheduling for Mixed-Priority Workloads in MoE Inference
by: Siavashi, Mohammad, et al.
Published: (2025)

HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference
by: Tang, Peng, et al.
Published: (2024)

Lynx: Enabling Efficient MoE Inference through Dynamic Batch-Aware Expert Selection
by: Gupta, Vima, et al.
Published: (2024)

DALI: A Workload-Aware Offloading Framework for Efficient MoE Inference on Local PCs
by: Zhu, Zeyu, et al.
Published: (2026)

Occult: Optimizing Collaborative Communication across Experts for Accelerated Parallel MoE Training and Inference
by: Luo, Shuqing, et al.
Published: (2025)

Relay Buffer Independent Communication over Pooled HBM for Efficient MoE Inference on Ascend
by: Hu, Tianlun, et al.
Published: (2026)

MoE Parallel Folding: Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training with Megatron Core
by: Liu, Dennis, et al.
Published: (2025)

Accelerating MoE Model Inference with Expert Sharding
by: Balmau, Oana, et al.
Published: (2025)

HierMoE: Accelerating MoE Training with Hierarchical Token Deduplication and Expert Swap
by: Lin, Wenxiang, et al.
Published: (2025)

From Tokens to Layers: Redefining Stall-Free Scheduling for MoE Serving with Layered Prefill
by: Lee, Gunjun, et al.
Published: (2025)

SP-MoE: Speculative Decoding and Prefetching for Accelerating MoE-based Model Inference
by: Chen, Liangkun, et al.
Published: (2025)

MoEless: Efficient MoE LLM Serving via Serverless Computing
by: Yu, Hanfei, et al.
Published: (2026)

Palladium: A DPU-enabled Multi-Tenant Serverless Cloud over Zero-copy Multi-node RDMA Fabrics
by: Qi, Shixiong, et al.
Published: (2025)

SuperFedNAS: Cost-Efficient Federated Neural Architecture Search for On-Device Inference
by: Khare, Alind, et al.
Published: (2023)

ViBE: Co-Optimizing Workload Skew and Hardware Variability for MoE Serving
by: Go, Seokjin, et al.
Published: (2026)

Andes: Defining and Enhancing Quality-of-Experience in LLM-Based Text Streaming Services
by: Liu, Jiachen, et al.
Published: (2024)

Semantic Parallelism: Redefining Efficient MoE Inference via Model-Data Co-Scheduling
by: Li, Yan, et al.
Published: (2025)

On Harnessing Idle Compute at the Edge for Foundation Model Training
by: Xue, Leyang, et al.
Published: (2025)

MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production
by: Jin, Chao, et al.
Published: (2025)

MoE-CAP: Benchmarking Cost, Accuracy and Performance of Sparse Mixture-of-Experts Systems
by: Jiang, Yinsicheng, et al.
Published: (2025)

MoE-CAP: Benchmarking Cost, Accuracy and Performance of Sparse Mixture-of-Experts Systems
by: Jiang, Yinsicheng, et al.
Published: (2024)

GRACE-MoE: Grouping and Replication with Locality-Aware Routing for Efficient Distributed MoE Inference
by: Han, Yu, et al.
Published: (2025)

MoE-SpeQ: Speculative Quantized Decoding with Proactive Expert Prefetching and Offloading for Mixture-of-Experts
by: Wang, Wenfeng, et al.
Published: (2025)

Prediction-Assisted Online Distributed Deep Learning Workload Scheduling in GPU Clusters
by: Luo, Ziyue, et al.
Published: (2025)

DualSparse-MoE: Coordinating Tensor/Neuron-Level Sparsity with Expert Partition and Reconstruction
by: Cai, Weilin, et al.
Published: (2025)

LAER-MoE: Load-Adaptive Expert Re-layout for Efficient Mixture-of-Experts Training
by: Liu, Xinyi, et al.
Published: (2026)

FlashMoE: Fast Distributed MoE in a Single Kernel
by: Aimuyo, Osayamen Jonathan, et al.
Published: (2025)

OD-MoE: On-Demand Expert Loading for Cacheless Edge-Distributed MoE Inference
by: Wang, Liujianfu, et al.
Published: (2025)

MoEBlaze: Breaking the Memory Wall for Efficient MoE Training on Modern GPUs
by: Zhang, Jiyuan, et al.
Published: (2026)

ReviveMoE: Fast Recovery for Hardware Failures in Large-Scale MoE LLM Inference Deployments
by: Li, Haley, et al.
Published: (2026)

Floe: Federated Specialization for Real-Time LLM-SLM Inference
by: Tian, Chunlin, et al.
Published: (2026)

SiDA-MoE: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models
by: Du, Zhixu, et al.
Published: (2023)

EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference
by: Qian, Yulei, et al.
Published: (2024)