:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Wang, Yida, Hong, Ke, Li, Xiuhong, Xu, Yuanchao, Wang, Wenxun, Dai, Guohao, Wang, Yu
Format:	Preprint
Published:	2025
Subjects:	Machine Learning Distributed, Parallel, and Cluster Computing
Online Access:	https://arxiv.org/abs/2509.26541
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

semi-PD: Towards Efficient LLM Serving via Phase-Wise Disaggregated Computation and Unified Storage
by: Hong, Ke, et al.
Published: (2025)

Efficient and Adaptable Overlapping for Computation and Communication via Signaling and Reordering
by: Hong, Ke, et al.
Published: (2025)

DynaTrain: Fast Online Parallelism Switching for Elastic LLM Training
by: Wang, Yuanqing, et al.
Published: (2026)

DCP: Addressing Input Dynamism In Long-Context Training via Dynamic Context Parallelism
by: Jiang, Chenyu, et al.
Published: (2025)

FlexSP: Accelerating Large Language Model Training via Flexible Sequence Parallelism
by: Wang, Yujie, et al.
Published: (2024)

Fast and Efficient 2-bit LLM Inference on GPU: 2/4/16-bit in a Weight Matrix with Asynchronous Dequantization
by: Li, Jinhao, et al.
Published: (2023)

DSP: Dynamic Sequence Parallelism for Multi-Dimensional Transformers
by: Zhao, Xuanlei, et al.
Published: (2024)

HelixPipe: Efficient Distributed Training of Long Sequence Transformers with Attention Parallel Pipeline Parallelism
by: Zhang, Geng, et al.
Published: (2025)

Beyond the Federation: Topology-aware Federated Learning for Generalization to Unseen Clients
by: Ma, Mengmeng, et al.
Published: (2024)

PipeFill: Using GPUs During Bubbles in Pipeline-parallel LLM Training
by: Arfeen, Daiyaan, et al.
Published: (2024)

Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping
by: Jiang, Chenyu, et al.
Published: (2024)

AutoSP: Unlocking Long-Context LLM Training Via Compiler-Based Sequence Parallelism
by: Gupta, Ahan, et al.
Published: (2026)

Occult: Optimizing Collaborative Communication across Experts for Accelerated Parallel MoE Training and Inference
by: Luo, Shuqing, et al.
Published: (2025)

CRAFT: Fine-Grained Cost-Aware Expert Replication For Efficient Mixture-of-Experts Serving
by: Zhao, Adrian, et al.
Published: (2026)

PipeLive: Efficient Live In-place Pipeline Parallelism Reconfiguration for Dynamic LLM Serving
by: Bai, Xu, et al.
Published: (2026)

Topology-aware Federated Learning in Edge Computing: A Comprehensive Survey
by: Wu, Jiajun, et al.
Published: (2023)

LoongServe: Efficiently Serving Long-Context Large Language Models with Elastic Sequence Parallelism
by: Wu, Bingyang, et al.
Published: (2024)

Memory and Bandwidth are All You Need for Fully Sharded Data Parallel
by: Wang, Jiangtao, et al.
Published: (2025)

ParaBlock: Communication-Computation Parallel Block Coordinate Federated Learning for Large Language Models
by: Wang, Yujia, et al.
Published: (2025)

Parallel Track Transformers: Enabling Fast GPU Inference with Reduced Synchronization
by: Wang, Chong, et al.
Published: (2026)

Stabilizing Decentralized Federated Fine-Tuning via Topology-Aware Alternating LoRA
by: Wang, Xiaoyu, et al.
Published: (2026)

Hydraulis: Balancing Large Transformer Model Training via Co-designing Parallel Strategies and Data Assignment
by: Li, Haoyang, et al.
Published: (2024)

Heterogeneous Parallelism for Multimodal Large Language Model Training
by: Karnati, Yashaswi, et al.
Published: (2026)

Locality-aware Fair Scheduling in LLM Serving
by: Cao, Shiyi, et al.
Published: (2025)

Towards cost-effective and resource-aware aggregation at Edge for Federated Learning
by: Khan, Ahmad Faraz, et al.
Published: (2022)

DYNAMIX: RL-based Adaptive Batch Size Optimization in Distributed Machine Learning Systems
by: Dai, Yuanjun, et al.
Published: (2025)

LIBRA: Enabling Workload-aware Multi-dimensional Network Topology Optimization for Distributed Training of Large AI Models
by: Won, William, et al.
Published: (2021)

Two-dimensional Sparse Parallelism for Large Scale Deep Learning Recommendation Model Training
by: Zhang, Xin, et al.
Published: (2025)

On Optimizing the Communication of Model Parallelism
by: Zhuang, Yonghao, et al.
Published: (2022)

MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism
by: Zhu, Ruidong, et al.
Published: (2025)

Aryl: An Elastic Cluster Scheduler for Deep Learning
by: Li, Jiamin, et al.
Published: (2022)

Unleashing the Power of Continual Learning on Non-Centralized Devices: A Survey
by: Li, Yichen, et al.
Published: (2024)

Re-evaluating the Memory-balanced Pipeline Parallelism: BPipe
by: Huang, Mincong, et al.
Published: (2024)

Arctic Inference with Shift Parallelism: Fast and Efficient Open Source Inference System for Enterprise AI
by: Rajbhandari, Samyam, et al.
Published: (2025)

Arena: Efficiently Training Large Models via Dynamic Scheduling and Adaptive Parallelism Co-Design
by: Xue, Chunyu, et al.
Published: (2024)

AMDP: Asynchronous Multi-Directional Pipeline Parallelism for Large-Scale Models Training
by: Chen, Ling, et al.
Published: (2026)

Locally Estimated Global Perturbations are Better than Local Perturbations for Federated Sharpness-aware Minimization
by: Fan, Ziqing, et al.
Published: (2024)

Enhancing Physics-Informed Neural Networks with a Hybrid Parallel Kolmogorov-Arnold and MLP Architecture
by: Xu, Zuyu, et al.
Published: (2025)

DHP: Efficient Scaling of MLLM Training with Dynamic Hybrid Parallelism
by: Niu, Yifan, et al.
Published: (2026)

MoE Parallel Folding: Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training with Megatron Core
by: Liu, Dennis, et al.
Published: (2025)