:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Yu, Hanfei, Cui, Xingqi, Zhang, Hong, Wang, Hao
Format:	Preprint
Published:	2025
Subjects:	Machine Learning Artificial Intelligence Distributed, Parallel, and Cluster Computing
Online Access:	https://arxiv.org/abs/2502.05370
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

MoEless: Efficient MoE LLM Serving via Serverless Computing
by: Yu, Hanfei, et al.
Published: (2026)

MemFine: Memory-Aware Fine-Grained Scheduling for MoE Training
by: Zhao, Lu, et al.
Published: (2025)

D$^{2}$MoE: Dual Routing and Dynamic Scheduling for Efficient On-Device MoE-based LLM Serving
by: Wang, Haodong, et al.
Published: (2025)

Accelerating Edge Inference for Distributed MoE Models with Latency-Optimized Expert Placement
by: Wu, Tian, et al.
Published: (2025)

Efficient MoE Inference with Fine-Grained Scheduling of Disaggregated Expert Parallelism
by: Pan, Xinglin, et al.
Published: (2025)

DuoServe-MoE: Dual-Phase Expert Prefetch and Caching for LLM Inference QoS Assurance
by: Zhang, Yuning, et al.
Published: (2025)

Efficient MoE Serving in the Memory-Bound Regime: Balance Activated Experts, Not Tokens
by: Yu, Yanpeng, et al.
Published: (2025)

Expert-as-a-Service: Towards Efficient, Scalable, and Robust Large-scale MoE Serving
by: Liu, Ziming, et al.
Published: (2025)

OD-MoE: On-Demand Expert Loading for Cacheless Edge-Distributed MoE Inference
by: Wang, Liujianfu, et al.
Published: (2025)

HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference
by: Tang, Peng, et al.
Published: (2024)

MoE-SpeQ: Speculative Quantized Decoding with Proactive Expert Prefetching and Offloading for Mixture-of-Experts
by: Wang, Wenfeng, et al.
Published: (2025)

ServerlessLoRA: Minimizing Latency and Cost in Serverless Inference for LoRA-Based LLMs
by: Sui, Yifan, et al.
Published: (2025)

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
by: Agrawal, Amey, et al.
Published: (2024)

ReviveMoE: Fast Recovery for Hardware Failures in Large-Scale MoE LLM Inference Deployments
by: Li, Haley, et al.
Published: (2026)

Janus: Disaggregating Attention and Experts for Scalable MoE Inference
by: Zhang, Zhexiang, et al.
Published: (2025)

LSH-MoE: Communication-efficient MoE Training via Locality-Sensitive Hashing
by: Nie, Xiaonan, et al.
Published: (2024)

Accelerating Mixture-of-Experts Inference by Hiding Offloading Latency with Speculative Decoding
by: Wang, Zhibin, et al.
Published: (2025)

MoE-Lens: Towards the Hardware Limit of High-Throughput MoE LLM Serving Under Resource Constraints
by: Yuan, Yichao, et al.
Published: (2025)

Hexa-MoE: Efficient and Heterogeneous-aware Training for Mixture-of-Experts
by: Luo, Shuqing, et al.
Published: (2024)

ProMoE: Fast MoE-based LLM Serving using Proactive Caching
by: Song, Xiaoniu, et al.
Published: (2024)

MixServe: An Automatic Distributed Serving System for MoE Models with Hybrid Parallelism Based on Fused Communication Algorithm
by: Zhou, Bowen, et al.
Published: (2026)

UniEP: Unified Expert-Parallel MoE MegaKernel for LLM Training
by: Zheng, Size, et al.
Published: (2026)

Taming GPU Underutilization via Static Partitioning and Fine-grained CPU Offloading
by: Schieffer, Gabin, et al.
Published: (2026)

Memory Offloading for Large Language Model Inference with Latency SLO Guarantees
by: Ma, Chenxiang, et al.
Published: (2025)

BrownoutServe: SLO-Aware Inference Serving under Bursty Workloads for MoE-based LLMs
by: Hu, Jianmin, et al.
Published: (2025)

Taming the Memory Footprint Crisis: System Design for Production Diffusion LLM Serving
by: Fan, Jiakun, et al.
Published: (2025)

eMoE: Task-aware Memory Efficient Mixture-of-Experts-Based (MoE) Model Inference
by: Tairin, Suraiya, et al.
Published: (2025)

Understanding Bottlenecks for Efficiently Serving LLM Inference With KV Offloading
by: Meng, William, et al.
Published: (2025)

MoE-Hub: Taming Software Complexity for Seamless MoE Overlap with Hardware-Accelerated Communication on Multi-GPU Systems
by: Zhou, Zhuoshan, et al.
Published: (2026)

Multi-Layer Scheduling for MoE-Based LLM Reasoning
by: Sun, Yifan, et al.
Published: (2026)

MoE-Compression: How the Compression Error of Experts Affects the Inference Accuracy of MoE Model?
by: Ma, Songkai, et al.
Published: (2025)

LAER-MoE: Load-Adaptive Expert Re-layout for Efficient Mixture-of-Experts Training
by: Liu, Xinyi, et al.
Published: (2026)

Accelerating Distributed MoE Training and Inference with Lina
by: Li, Jiamin, et al.
Published: (2022)

EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference
by: Qian, Yulei, et al.
Published: (2024)

Fine-grained MoE Load Balancing with Linear Programming
by: Zhao, Chenqi, et al.
Published: (2025)

ExpertFlow: Adaptive Expert Scheduling and Memory Coordination for Efficient MoE Inference
by: Shen, Zixu, et al.
Published: (2025)

Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference
by: Sun, Xun, et al.
Published: (2026)

OmniInfer: System-Wide Acceleration Techniques for Optimizing LLM Serving Throughput and Latency
by: Wang, Jun, et al.
Published: (2025)

DALI: A Workload-Aware Offloading Framework for Efficient MoE Inference on Local PCs
by: Zhu, Zeyu, et al.
Published: (2026)

MemAscend: System Memory Optimization for SSD-Offloaded LLM Fine-Tuning
by: Liaw, Yong-Cheng, et al.
Published: (2025)