Saved in:
| Main Authors: | Yu, Hanfei, Cui, Xingqi, Zhang, Hong, Wang, Hao |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2502.05370 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
MoEless: Efficient MoE LLM Serving via Serverless Computing
by: Yu, Hanfei, et al.
Published: (2026)
by: Yu, Hanfei, et al.
Published: (2026)
MemFine: Memory-Aware Fine-Grained Scheduling for MoE Training
by: Zhao, Lu, et al.
Published: (2025)
by: Zhao, Lu, et al.
Published: (2025)
D$^{2}$MoE: Dual Routing and Dynamic Scheduling for Efficient On-Device MoE-based LLM Serving
by: Wang, Haodong, et al.
Published: (2025)
by: Wang, Haodong, et al.
Published: (2025)
Accelerating Edge Inference for Distributed MoE Models with Latency-Optimized Expert Placement
by: Wu, Tian, et al.
Published: (2025)
by: Wu, Tian, et al.
Published: (2025)
Efficient MoE Inference with Fine-Grained Scheduling of Disaggregated Expert Parallelism
by: Pan, Xinglin, et al.
Published: (2025)
by: Pan, Xinglin, et al.
Published: (2025)
DuoServe-MoE: Dual-Phase Expert Prefetch and Caching for LLM Inference QoS Assurance
by: Zhang, Yuning, et al.
Published: (2025)
by: Zhang, Yuning, et al.
Published: (2025)
Efficient MoE Serving in the Memory-Bound Regime: Balance Activated Experts, Not Tokens
by: Yu, Yanpeng, et al.
Published: (2025)
by: Yu, Yanpeng, et al.
Published: (2025)
Expert-as-a-Service: Towards Efficient, Scalable, and Robust Large-scale MoE Serving
by: Liu, Ziming, et al.
Published: (2025)
by: Liu, Ziming, et al.
Published: (2025)
OD-MoE: On-Demand Expert Loading for Cacheless Edge-Distributed MoE Inference
by: Wang, Liujianfu, et al.
Published: (2025)
by: Wang, Liujianfu, et al.
Published: (2025)
HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference
by: Tang, Peng, et al.
Published: (2024)
by: Tang, Peng, et al.
Published: (2024)
MoE-SpeQ: Speculative Quantized Decoding with Proactive Expert Prefetching and Offloading for Mixture-of-Experts
by: Wang, Wenfeng, et al.
Published: (2025)
by: Wang, Wenfeng, et al.
Published: (2025)
ServerlessLoRA: Minimizing Latency and Cost in Serverless Inference for LoRA-Based LLMs
by: Sui, Yifan, et al.
Published: (2025)
by: Sui, Yifan, et al.
Published: (2025)
Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
by: Agrawal, Amey, et al.
Published: (2024)
by: Agrawal, Amey, et al.
Published: (2024)
ReviveMoE: Fast Recovery for Hardware Failures in Large-Scale MoE LLM Inference Deployments
by: Li, Haley, et al.
Published: (2026)
by: Li, Haley, et al.
Published: (2026)
Janus: Disaggregating Attention and Experts for Scalable MoE Inference
by: Zhang, Zhexiang, et al.
Published: (2025)
by: Zhang, Zhexiang, et al.
Published: (2025)
LSH-MoE: Communication-efficient MoE Training via Locality-Sensitive Hashing
by: Nie, Xiaonan, et al.
Published: (2024)
by: Nie, Xiaonan, et al.
Published: (2024)
Accelerating Mixture-of-Experts Inference by Hiding Offloading Latency with Speculative Decoding
by: Wang, Zhibin, et al.
Published: (2025)
by: Wang, Zhibin, et al.
Published: (2025)
MoE-Lens: Towards the Hardware Limit of High-Throughput MoE LLM Serving Under Resource Constraints
by: Yuan, Yichao, et al.
Published: (2025)
by: Yuan, Yichao, et al.
Published: (2025)
Hexa-MoE: Efficient and Heterogeneous-aware Training for Mixture-of-Experts
by: Luo, Shuqing, et al.
Published: (2024)
by: Luo, Shuqing, et al.
Published: (2024)
ProMoE: Fast MoE-based LLM Serving using Proactive Caching
by: Song, Xiaoniu, et al.
Published: (2024)
by: Song, Xiaoniu, et al.
Published: (2024)
MixServe: An Automatic Distributed Serving System for MoE Models with Hybrid Parallelism Based on Fused Communication Algorithm
by: Zhou, Bowen, et al.
Published: (2026)
by: Zhou, Bowen, et al.
Published: (2026)
UniEP: Unified Expert-Parallel MoE MegaKernel for LLM Training
by: Zheng, Size, et al.
Published: (2026)
by: Zheng, Size, et al.
Published: (2026)
Taming GPU Underutilization via Static Partitioning and Fine-grained CPU Offloading
by: Schieffer, Gabin, et al.
Published: (2026)
by: Schieffer, Gabin, et al.
Published: (2026)
Memory Offloading for Large Language Model Inference with Latency SLO Guarantees
by: Ma, Chenxiang, et al.
Published: (2025)
by: Ma, Chenxiang, et al.
Published: (2025)
BrownoutServe: SLO-Aware Inference Serving under Bursty Workloads for MoE-based LLMs
by: Hu, Jianmin, et al.
Published: (2025)
by: Hu, Jianmin, et al.
Published: (2025)
Taming the Memory Footprint Crisis: System Design for Production Diffusion LLM Serving
by: Fan, Jiakun, et al.
Published: (2025)
by: Fan, Jiakun, et al.
Published: (2025)
eMoE: Task-aware Memory Efficient Mixture-of-Experts-Based (MoE) Model Inference
by: Tairin, Suraiya, et al.
Published: (2025)
by: Tairin, Suraiya, et al.
Published: (2025)
Understanding Bottlenecks for Efficiently Serving LLM Inference With KV Offloading
by: Meng, William, et al.
Published: (2025)
by: Meng, William, et al.
Published: (2025)
MoE-Hub: Taming Software Complexity for Seamless MoE Overlap with Hardware-Accelerated Communication on Multi-GPU Systems
by: Zhou, Zhuoshan, et al.
Published: (2026)
by: Zhou, Zhuoshan, et al.
Published: (2026)
Multi-Layer Scheduling for MoE-Based LLM Reasoning
by: Sun, Yifan, et al.
Published: (2026)
by: Sun, Yifan, et al.
Published: (2026)
MoE-Compression: How the Compression Error of Experts Affects the Inference Accuracy of MoE Model?
by: Ma, Songkai, et al.
Published: (2025)
by: Ma, Songkai, et al.
Published: (2025)
LAER-MoE: Load-Adaptive Expert Re-layout for Efficient Mixture-of-Experts Training
by: Liu, Xinyi, et al.
Published: (2026)
by: Liu, Xinyi, et al.
Published: (2026)
Accelerating Distributed MoE Training and Inference with Lina
by: Li, Jiamin, et al.
Published: (2022)
by: Li, Jiamin, et al.
Published: (2022)
EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference
by: Qian, Yulei, et al.
Published: (2024)
by: Qian, Yulei, et al.
Published: (2024)
Fine-grained MoE Load Balancing with Linear Programming
by: Zhao, Chenqi, et al.
Published: (2025)
by: Zhao, Chenqi, et al.
Published: (2025)
ExpertFlow: Adaptive Expert Scheduling and Memory Coordination for Efficient MoE Inference
by: Shen, Zixu, et al.
Published: (2025)
by: Shen, Zixu, et al.
Published: (2025)
Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference
by: Sun, Xun, et al.
Published: (2026)
by: Sun, Xun, et al.
Published: (2026)
OmniInfer: System-Wide Acceleration Techniques for Optimizing LLM Serving Throughput and Latency
by: Wang, Jun, et al.
Published: (2025)
by: Wang, Jun, et al.
Published: (2025)
DALI: A Workload-Aware Offloading Framework for Efficient MoE Inference on Local PCs
by: Zhu, Zeyu, et al.
Published: (2026)
by: Zhu, Zeyu, et al.
Published: (2026)
MemAscend: System Memory Optimization for SSD-Offloaded LLM Fine-Tuning
by: Liaw, Yong-Cheng, et al.
Published: (2025)
by: Liaw, Yong-Cheng, et al.
Published: (2025)
Similar Items
-
MoEless: Efficient MoE LLM Serving via Serverless Computing
by: Yu, Hanfei, et al.
Published: (2026) -
MemFine: Memory-Aware Fine-Grained Scheduling for MoE Training
by: Zhao, Lu, et al.
Published: (2025) -
D$^{2}$MoE: Dual Routing and Dynamic Scheduling for Efficient On-Device MoE-based LLM Serving
by: Wang, Haodong, et al.
Published: (2025) -
Accelerating Edge Inference for Distributed MoE Models with Latency-Optimized Expert Placement
by: Wu, Tian, et al.
Published: (2025) -
Efficient MoE Inference with Fine-Grained Scheduling of Disaggregated Expert Parallelism
by: Pan, Xinglin, et al.
Published: (2025)