Saved in:
| Main Authors: | Wang, Yida, Hong, Ke, Li, Xiuhong, Xu, Yuanchao, Wang, Wenxun, Dai, Guohao, Wang, Yu |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2509.26541 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
semi-PD: Towards Efficient LLM Serving via Phase-Wise Disaggregated Computation and Unified Storage
by: Hong, Ke, et al.
Published: (2025)
by: Hong, Ke, et al.
Published: (2025)
Efficient and Adaptable Overlapping for Computation and Communication via Signaling and Reordering
by: Hong, Ke, et al.
Published: (2025)
by: Hong, Ke, et al.
Published: (2025)
DynaTrain: Fast Online Parallelism Switching for Elastic LLM Training
by: Wang, Yuanqing, et al.
Published: (2026)
by: Wang, Yuanqing, et al.
Published: (2026)
DCP: Addressing Input Dynamism In Long-Context Training via Dynamic Context Parallelism
by: Jiang, Chenyu, et al.
Published: (2025)
by: Jiang, Chenyu, et al.
Published: (2025)
FlexSP: Accelerating Large Language Model Training via Flexible Sequence Parallelism
by: Wang, Yujie, et al.
Published: (2024)
by: Wang, Yujie, et al.
Published: (2024)
Fast and Efficient 2-bit LLM Inference on GPU: 2/4/16-bit in a Weight Matrix with Asynchronous Dequantization
by: Li, Jinhao, et al.
Published: (2023)
by: Li, Jinhao, et al.
Published: (2023)
DSP: Dynamic Sequence Parallelism for Multi-Dimensional Transformers
by: Zhao, Xuanlei, et al.
Published: (2024)
by: Zhao, Xuanlei, et al.
Published: (2024)
HelixPipe: Efficient Distributed Training of Long Sequence Transformers with Attention Parallel Pipeline Parallelism
by: Zhang, Geng, et al.
Published: (2025)
by: Zhang, Geng, et al.
Published: (2025)
Beyond the Federation: Topology-aware Federated Learning for Generalization to Unseen Clients
by: Ma, Mengmeng, et al.
Published: (2024)
by: Ma, Mengmeng, et al.
Published: (2024)
PipeFill: Using GPUs During Bubbles in Pipeline-parallel LLM Training
by: Arfeen, Daiyaan, et al.
Published: (2024)
by: Arfeen, Daiyaan, et al.
Published: (2024)
Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping
by: Jiang, Chenyu, et al.
Published: (2024)
by: Jiang, Chenyu, et al.
Published: (2024)
AutoSP: Unlocking Long-Context LLM Training Via Compiler-Based Sequence Parallelism
by: Gupta, Ahan, et al.
Published: (2026)
by: Gupta, Ahan, et al.
Published: (2026)
Occult: Optimizing Collaborative Communication across Experts for Accelerated Parallel MoE Training and Inference
by: Luo, Shuqing, et al.
Published: (2025)
by: Luo, Shuqing, et al.
Published: (2025)
CRAFT: Fine-Grained Cost-Aware Expert Replication For Efficient Mixture-of-Experts Serving
by: Zhao, Adrian, et al.
Published: (2026)
by: Zhao, Adrian, et al.
Published: (2026)
PipeLive: Efficient Live In-place Pipeline Parallelism Reconfiguration for Dynamic LLM Serving
by: Bai, Xu, et al.
Published: (2026)
by: Bai, Xu, et al.
Published: (2026)
Topology-aware Federated Learning in Edge Computing: A Comprehensive Survey
by: Wu, Jiajun, et al.
Published: (2023)
by: Wu, Jiajun, et al.
Published: (2023)
LoongServe: Efficiently Serving Long-Context Large Language Models with Elastic Sequence Parallelism
by: Wu, Bingyang, et al.
Published: (2024)
by: Wu, Bingyang, et al.
Published: (2024)
Memory and Bandwidth are All You Need for Fully Sharded Data Parallel
by: Wang, Jiangtao, et al.
Published: (2025)
by: Wang, Jiangtao, et al.
Published: (2025)
ParaBlock: Communication-Computation Parallel Block Coordinate Federated Learning for Large Language Models
by: Wang, Yujia, et al.
Published: (2025)
by: Wang, Yujia, et al.
Published: (2025)
Parallel Track Transformers: Enabling Fast GPU Inference with Reduced Synchronization
by: Wang, Chong, et al.
Published: (2026)
by: Wang, Chong, et al.
Published: (2026)
Stabilizing Decentralized Federated Fine-Tuning via Topology-Aware Alternating LoRA
by: Wang, Xiaoyu, et al.
Published: (2026)
by: Wang, Xiaoyu, et al.
Published: (2026)
Hydraulis: Balancing Large Transformer Model Training via Co-designing Parallel Strategies and Data Assignment
by: Li, Haoyang, et al.
Published: (2024)
by: Li, Haoyang, et al.
Published: (2024)
Heterogeneous Parallelism for Multimodal Large Language Model Training
by: Karnati, Yashaswi, et al.
Published: (2026)
by: Karnati, Yashaswi, et al.
Published: (2026)
Locality-aware Fair Scheduling in LLM Serving
by: Cao, Shiyi, et al.
Published: (2025)
by: Cao, Shiyi, et al.
Published: (2025)
Towards cost-effective and resource-aware aggregation at Edge for Federated Learning
by: Khan, Ahmad Faraz, et al.
Published: (2022)
by: Khan, Ahmad Faraz, et al.
Published: (2022)
DYNAMIX: RL-based Adaptive Batch Size Optimization in Distributed Machine Learning Systems
by: Dai, Yuanjun, et al.
Published: (2025)
by: Dai, Yuanjun, et al.
Published: (2025)
LIBRA: Enabling Workload-aware Multi-dimensional Network Topology Optimization for Distributed Training of Large AI Models
by: Won, William, et al.
Published: (2021)
by: Won, William, et al.
Published: (2021)
Two-dimensional Sparse Parallelism for Large Scale Deep Learning Recommendation Model Training
by: Zhang, Xin, et al.
Published: (2025)
by: Zhang, Xin, et al.
Published: (2025)
On Optimizing the Communication of Model Parallelism
by: Zhuang, Yonghao, et al.
Published: (2022)
by: Zhuang, Yonghao, et al.
Published: (2022)
MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism
by: Zhu, Ruidong, et al.
Published: (2025)
by: Zhu, Ruidong, et al.
Published: (2025)
Aryl: An Elastic Cluster Scheduler for Deep Learning
by: Li, Jiamin, et al.
Published: (2022)
by: Li, Jiamin, et al.
Published: (2022)
Unleashing the Power of Continual Learning on Non-Centralized Devices: A Survey
by: Li, Yichen, et al.
Published: (2024)
by: Li, Yichen, et al.
Published: (2024)
Re-evaluating the Memory-balanced Pipeline Parallelism: BPipe
by: Huang, Mincong, et al.
Published: (2024)
by: Huang, Mincong, et al.
Published: (2024)
Arctic Inference with Shift Parallelism: Fast and Efficient Open Source Inference System for Enterprise AI
by: Rajbhandari, Samyam, et al.
Published: (2025)
by: Rajbhandari, Samyam, et al.
Published: (2025)
Arena: Efficiently Training Large Models via Dynamic Scheduling and Adaptive Parallelism Co-Design
by: Xue, Chunyu, et al.
Published: (2024)
by: Xue, Chunyu, et al.
Published: (2024)
AMDP: Asynchronous Multi-Directional Pipeline Parallelism for Large-Scale Models Training
by: Chen, Ling, et al.
Published: (2026)
by: Chen, Ling, et al.
Published: (2026)
Locally Estimated Global Perturbations are Better than Local Perturbations for Federated Sharpness-aware Minimization
by: Fan, Ziqing, et al.
Published: (2024)
by: Fan, Ziqing, et al.
Published: (2024)
Enhancing Physics-Informed Neural Networks with a Hybrid Parallel Kolmogorov-Arnold and MLP Architecture
by: Xu, Zuyu, et al.
Published: (2025)
by: Xu, Zuyu, et al.
Published: (2025)
DHP: Efficient Scaling of MLLM Training with Dynamic Hybrid Parallelism
by: Niu, Yifan, et al.
Published: (2026)
by: Niu, Yifan, et al.
Published: (2026)
MoE Parallel Folding: Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training with Megatron Core
by: Liu, Dennis, et al.
Published: (2025)
by: Liu, Dennis, et al.
Published: (2025)
Similar Items
-
semi-PD: Towards Efficient LLM Serving via Phase-Wise Disaggregated Computation and Unified Storage
by: Hong, Ke, et al.
Published: (2025) -
Efficient and Adaptable Overlapping for Computation and Communication via Signaling and Reordering
by: Hong, Ke, et al.
Published: (2025) -
DynaTrain: Fast Online Parallelism Switching for Elastic LLM Training
by: Wang, Yuanqing, et al.
Published: (2026) -
DCP: Addressing Input Dynamism In Long-Context Training via Dynamic Context Parallelism
by: Jiang, Chenyu, et al.
Published: (2025) -
FlexSP: Accelerating Large Language Model Training via Flexible Sequence Parallelism
by: Wang, Yujie, et al.
Published: (2024)