Saved in:
| Main Authors: | Zhang, Songyu, Tam, Aaron, Lee, Myungjin, Qi, Shixiong, Ramakrishnan, K. K. |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2601.01310 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
LIFL: A Lightweight, Event-driven Serverless Platform for Federated Learning
by: Qi, Shixiong, et al.
Published: (2024)
by: Qi, Shixiong, et al.
Published: (2024)
MoE-Compression: How the Compression Error of Experts Affects the Inference Accuracy of MoE Model?
by: Ma, Songkai, et al.
Published: (2025)
by: Ma, Songkai, et al.
Published: (2025)
MoE-Gen: High-Throughput MoE Inference on a Single GPU with Module-Based Batching
by: Xu, Tairan, et al.
Published: (2025)
by: Xu, Tairan, et al.
Published: (2025)
MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs
by: Cao, Shiyi, et al.
Published: (2024)
by: Cao, Shiyi, et al.
Published: (2024)
HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference
by: Zhong, Shuzhang, et al.
Published: (2025)
by: Zhong, Shuzhang, et al.
Published: (2025)
DAOP: Data-Aware Offloading and Predictive Pre-Calculation for Efficient MoE Inference
by: Zhang, Yujie, et al.
Published: (2024)
by: Zhang, Yujie, et al.
Published: (2024)
eMoE: Task-aware Memory Efficient Mixture-of-Experts-Based (MoE) Model Inference
by: Tairin, Suraiya, et al.
Published: (2025)
by: Tairin, Suraiya, et al.
Published: (2025)
Priority-Aware Preemptive Scheduling for Mixed-Priority Workloads in MoE Inference
by: Siavashi, Mohammad, et al.
Published: (2025)
by: Siavashi, Mohammad, et al.
Published: (2025)
HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference
by: Tang, Peng, et al.
Published: (2024)
by: Tang, Peng, et al.
Published: (2024)
Lynx: Enabling Efficient MoE Inference through Dynamic Batch-Aware Expert Selection
by: Gupta, Vima, et al.
Published: (2024)
by: Gupta, Vima, et al.
Published: (2024)
DALI: A Workload-Aware Offloading Framework for Efficient MoE Inference on Local PCs
by: Zhu, Zeyu, et al.
Published: (2026)
by: Zhu, Zeyu, et al.
Published: (2026)
Occult: Optimizing Collaborative Communication across Experts for Accelerated Parallel MoE Training and Inference
by: Luo, Shuqing, et al.
Published: (2025)
by: Luo, Shuqing, et al.
Published: (2025)
Relay Buffer Independent Communication over Pooled HBM for Efficient MoE Inference on Ascend
by: Hu, Tianlun, et al.
Published: (2026)
by: Hu, Tianlun, et al.
Published: (2026)
MoE Parallel Folding: Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training with Megatron Core
by: Liu, Dennis, et al.
Published: (2025)
by: Liu, Dennis, et al.
Published: (2025)
Accelerating MoE Model Inference with Expert Sharding
by: Balmau, Oana, et al.
Published: (2025)
by: Balmau, Oana, et al.
Published: (2025)
HierMoE: Accelerating MoE Training with Hierarchical Token Deduplication and Expert Swap
by: Lin, Wenxiang, et al.
Published: (2025)
by: Lin, Wenxiang, et al.
Published: (2025)
From Tokens to Layers: Redefining Stall-Free Scheduling for MoE Serving with Layered Prefill
by: Lee, Gunjun, et al.
Published: (2025)
by: Lee, Gunjun, et al.
Published: (2025)
SP-MoE: Speculative Decoding and Prefetching for Accelerating MoE-based Model Inference
by: Chen, Liangkun, et al.
Published: (2025)
by: Chen, Liangkun, et al.
Published: (2025)
MoEless: Efficient MoE LLM Serving via Serverless Computing
by: Yu, Hanfei, et al.
Published: (2026)
by: Yu, Hanfei, et al.
Published: (2026)
Palladium: A DPU-enabled Multi-Tenant Serverless Cloud over Zero-copy Multi-node RDMA Fabrics
by: Qi, Shixiong, et al.
Published: (2025)
by: Qi, Shixiong, et al.
Published: (2025)
SuperFedNAS: Cost-Efficient Federated Neural Architecture Search for On-Device Inference
by: Khare, Alind, et al.
Published: (2023)
by: Khare, Alind, et al.
Published: (2023)
ViBE: Co-Optimizing Workload Skew and Hardware Variability for MoE Serving
by: Go, Seokjin, et al.
Published: (2026)
by: Go, Seokjin, et al.
Published: (2026)
Andes: Defining and Enhancing Quality-of-Experience in LLM-Based Text Streaming Services
by: Liu, Jiachen, et al.
Published: (2024)
by: Liu, Jiachen, et al.
Published: (2024)
Semantic Parallelism: Redefining Efficient MoE Inference via Model-Data Co-Scheduling
by: Li, Yan, et al.
Published: (2025)
by: Li, Yan, et al.
Published: (2025)
On Harnessing Idle Compute at the Edge for Foundation Model Training
by: Xue, Leyang, et al.
Published: (2025)
by: Xue, Leyang, et al.
Published: (2025)
MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production
by: Jin, Chao, et al.
Published: (2025)
by: Jin, Chao, et al.
Published: (2025)
MoE-CAP: Benchmarking Cost, Accuracy and Performance of Sparse Mixture-of-Experts Systems
by: Jiang, Yinsicheng, et al.
Published: (2025)
by: Jiang, Yinsicheng, et al.
Published: (2025)
MoE-CAP: Benchmarking Cost, Accuracy and Performance of Sparse Mixture-of-Experts Systems
by: Jiang, Yinsicheng, et al.
Published: (2024)
by: Jiang, Yinsicheng, et al.
Published: (2024)
GRACE-MoE: Grouping and Replication with Locality-Aware Routing for Efficient Distributed MoE Inference
by: Han, Yu, et al.
Published: (2025)
by: Han, Yu, et al.
Published: (2025)
MoE-SpeQ: Speculative Quantized Decoding with Proactive Expert Prefetching and Offloading for Mixture-of-Experts
by: Wang, Wenfeng, et al.
Published: (2025)
by: Wang, Wenfeng, et al.
Published: (2025)
Prediction-Assisted Online Distributed Deep Learning Workload Scheduling in GPU Clusters
by: Luo, Ziyue, et al.
Published: (2025)
by: Luo, Ziyue, et al.
Published: (2025)
DualSparse-MoE: Coordinating Tensor/Neuron-Level Sparsity with Expert Partition and Reconstruction
by: Cai, Weilin, et al.
Published: (2025)
by: Cai, Weilin, et al.
Published: (2025)
LAER-MoE: Load-Adaptive Expert Re-layout for Efficient Mixture-of-Experts Training
by: Liu, Xinyi, et al.
Published: (2026)
by: Liu, Xinyi, et al.
Published: (2026)
FlashMoE: Fast Distributed MoE in a Single Kernel
by: Aimuyo, Osayamen Jonathan, et al.
Published: (2025)
by: Aimuyo, Osayamen Jonathan, et al.
Published: (2025)
OD-MoE: On-Demand Expert Loading for Cacheless Edge-Distributed MoE Inference
by: Wang, Liujianfu, et al.
Published: (2025)
by: Wang, Liujianfu, et al.
Published: (2025)
MoEBlaze: Breaking the Memory Wall for Efficient MoE Training on Modern GPUs
by: Zhang, Jiyuan, et al.
Published: (2026)
by: Zhang, Jiyuan, et al.
Published: (2026)
ReviveMoE: Fast Recovery for Hardware Failures in Large-Scale MoE LLM Inference Deployments
by: Li, Haley, et al.
Published: (2026)
by: Li, Haley, et al.
Published: (2026)
Floe: Federated Specialization for Real-Time LLM-SLM Inference
by: Tian, Chunlin, et al.
Published: (2026)
by: Tian, Chunlin, et al.
Published: (2026)
SiDA-MoE: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models
by: Du, Zhixu, et al.
Published: (2023)
by: Du, Zhixu, et al.
Published: (2023)
EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference
by: Qian, Yulei, et al.
Published: (2024)
by: Qian, Yulei, et al.
Published: (2024)
Similar Items
-
LIFL: A Lightweight, Event-driven Serverless Platform for Federated Learning
by: Qi, Shixiong, et al.
Published: (2024) -
MoE-Compression: How the Compression Error of Experts Affects the Inference Accuracy of MoE Model?
by: Ma, Songkai, et al.
Published: (2025) -
MoE-Gen: High-Throughput MoE Inference on a Single GPU with Module-Based Batching
by: Xu, Tairan, et al.
Published: (2025) -
MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs
by: Cao, Shiyi, et al.
Published: (2024) -
HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference
by: Zhong, Shuzhang, et al.
Published: (2025)