Saved in:
| Main Authors: | Meng, Lin, Sun, Yuzhong |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2503.16815 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
DreamDDP: Accelerating Data Parallel Distributed LLM Training with Layer-wise Scheduled Partial Synchronization
by: Tang, Zhenheng, et al.
Published: (2025)
by: Tang, Zhenheng, et al.
Published: (2025)
BlockRaFT: A Distributed Framework for Fault-Tolerant and Scalable Blockchain Nodes
by: Piduguralla, Manaswini, et al.
Published: (2026)
by: Piduguralla, Manaswini, et al.
Published: (2026)
Bandwidth-Aware and Cost-Efficient Pipeline Parallel Scheduling in Geo-Distributed LLM Training
by: Zhang, Han, et al.
Published: (2026)
by: Zhang, Han, et al.
Published: (2026)
Task Scheduling in Geo-Distributed Computing: A Survey
by: Wu, Yujian, et al.
Published: (2025)
by: Wu, Yujian, et al.
Published: (2025)
Data-Locality-Aware Task Assignment and Scheduling for Distributed Job Executions
by: Zhao, Hailiang, et al.
Published: (2024)
by: Zhao, Hailiang, et al.
Published: (2024)
QoE-oriented Dependent Task Scheduling under Multi-dimensional QoS Constraints over Distributed Networks
by: Fan, Xuwei, et al.
Published: (2023)
by: Fan, Xuwei, et al.
Published: (2023)
DynaFlow: Transparent and Flexible Intra-Device Parallelism via Programmable Operator Scheduling
by: Pan, Yi, et al.
Published: (2026)
by: Pan, Yi, et al.
Published: (2026)
FlowMoE: A Scalable Pipeline Scheduling Framework for Distributed Mixture-of-Experts Training
by: Gao, Yunqi, et al.
Published: (2025)
by: Gao, Yunqi, et al.
Published: (2025)
Scheduling Data-Intensive Workloads in Large-Scale Distributed Systems: Trends and Challenges
by: Stavrinides, Georgios L., et al.
Published: (2025)
by: Stavrinides, Georgios L., et al.
Published: (2025)
ACE-Sync: An Adaptive Cloud-Edge Synchronization Framework for Communication-Efficient Large-Scale Distributed Model Training
by: Yang, Yi, et al.
Published: (2025)
by: Yang, Yi, et al.
Published: (2025)
Raptor: Distributed Scheduling for Serverless Functions
by: Exton, Kevin, et al.
Published: (2024)
by: Exton, Kevin, et al.
Published: (2024)
MemFine: Memory-Aware Fine-Grained Scheduling for MoE Training
by: Zhao, Lu, et al.
Published: (2025)
by: Zhao, Lu, et al.
Published: (2025)
iDDS: Intelligent Distributed Dispatch and Scheduling for Workflow Orchestration
by: Guan, Wen, et al.
Published: (2025)
by: Guan, Wen, et al.
Published: (2025)
Lagom: Unleashing the Power of Communication and Computation Overlapping for Distributed LLM Training
by: Xu, Guanbin, et al.
Published: (2026)
by: Xu, Guanbin, et al.
Published: (2026)
CO2: Efficient Distributed Training with Full Communication-Computation Overlap
by: Sun, Weigao, et al.
Published: (2024)
by: Sun, Weigao, et al.
Published: (2024)
FUSCO: High-Performance Distributed Data Shuffling via Transformation-Communication Fusion
by: Zhu, Zhuoran, et al.
Published: (2025)
by: Zhu, Zhuoran, et al.
Published: (2025)
Scheduling of Distributed Applications on the Computing Continuum: A Survey
by: Mehran, Narges, et al.
Published: (2024)
by: Mehran, Narges, et al.
Published: (2024)
Trustworthy Scheduling for Big Data Applications
by: Tomaras, Dimitrios, et al.
Published: (2026)
by: Tomaras, Dimitrios, et al.
Published: (2026)
Schedule-Level Shared-Prefix Reuse for LLM RL Training
by: Li, Pengbo, et al.
Published: (2026)
by: Li, Pengbo, et al.
Published: (2026)
Parm: Efficient Training of Large Sparsely-Activated Models with Dedicated Schedules
by: Pan, Xinglin, et al.
Published: (2024)
by: Pan, Xinglin, et al.
Published: (2024)
RapidGNN: Communication Efficient Large-Scale Distributed Training of Graph Neural Networks
by: Niam, Arefin, et al.
Published: (2025)
by: Niam, Arefin, et al.
Published: (2025)
Hiding Communication Cost in Distributed LLM Training via Micro-batch Co-execution
by: Wang, Haiquan, et al.
Published: (2024)
by: Wang, Haiquan, et al.
Published: (2024)
CondenseGraph: Communication-Efficient Distributed GNN Training via On-the-Fly Graph Condensation
by: Zhang, Zizhao, et al.
Published: (2026)
by: Zhang, Zizhao, et al.
Published: (2026)
GreenDyGNN: Runtime-Adaptive Energy-Efficient Communication for Distributed GNN Training
by: Niam, Arefin, et al.
Published: (2026)
by: Niam, Arefin, et al.
Published: (2026)
CrossPipe: Towards Optimal Pipeline Schedules for Cross-Datacenter Training
by: Chen, Tiancheng, et al.
Published: (2025)
by: Chen, Tiancheng, et al.
Published: (2025)
Learning to Schedule: A Supervised Learning Framework for Network-Aware Scheduling of Data-Intensive Workloads
by: Timilsina, Sankalpa, et al.
Published: (2025)
by: Timilsina, Sankalpa, et al.
Published: (2025)
A Study on the Performance of Distributed Training of Data-driven CFD Simulations
by: Iserte, Sergio, et al.
Published: (2026)
by: Iserte, Sergio, et al.
Published: (2026)
PruneX: A Hierarchical Communication-Efficient System for Distributed CNN Training with Structured Pruning
by: Olama, Alireza, et al.
Published: (2025)
by: Olama, Alireza, et al.
Published: (2025)
Eventually-Consistent Federated Scheduling for Data Center Workloads
by: Thiyyakat, Meghana, et al.
Published: (2023)
by: Thiyyakat, Meghana, et al.
Published: (2023)
Retrofitting Service Dependency Discovery in Distributed Systems
by: Landau, Diogo, et al.
Published: (2025)
by: Landau, Diogo, et al.
Published: (2025)
Efficient Training of Large Language Models on Distributed Infrastructures: A Survey
by: Duan, Jiangfei, et al.
Published: (2024)
by: Duan, Jiangfei, et al.
Published: (2024)
FedFT: Improving Communication Performance for Federated Learning with Frequency Space Transformation
by: Palihawadana, Chamath, et al.
Published: (2024)
by: Palihawadana, Chamath, et al.
Published: (2024)
Communication-Efficient Distributed Learning via Sparse and Adaptive Stochastic Gradient
by: Deng, Xiaoge, et al.
Published: (2021)
by: Deng, Xiaoge, et al.
Published: (2021)
DaggerFFT: A Distributed FFT Framework Using Task Scheduling in Julia
by: Anvari, Sana Taghipour, et al.
Published: (2026)
by: Anvari, Sana Taghipour, et al.
Published: (2026)
LuWu: An End-to-End In-Network Out-of-Core Optimizer for 100B-Scale Model-in-Network Data-Parallel Training on Distributed GPUs
by: Sun, Mo, et al.
Published: (2024)
by: Sun, Mo, et al.
Published: (2024)
Metronome: Efficient Scheduling for Periodic Traffic Jobs with Network and Priority Awareness
by: Jiang, Hao, et al.
Published: (2025)
by: Jiang, Hao, et al.
Published: (2025)
A Flexible Programmable Pipeline Parallelism Framework for Efficient DNN Training
by: Jiang, Lijuan, et al.
Published: (2025)
by: Jiang, Lijuan, et al.
Published: (2025)
Distributed Load Balancing with Workload-Dependent Service Rates
by: Zhang, Wenxin, et al.
Published: (2024)
by: Zhang, Wenxin, et al.
Published: (2024)
A Reinforcement Learning-Driven Task Scheduling Algorithm for Multi-Tenant Distributed Systems
by: Zhang, Xiaopei, et al.
Published: (2025)
by: Zhang, Xiaopei, et al.
Published: (2025)
Optimizing Frequent Checkpointing via Low-Cost Differential for Distributed Training Systems
by: Yao, Chenxuan, et al.
Published: (2025)
by: Yao, Chenxuan, et al.
Published: (2025)
Similar Items
-
DreamDDP: Accelerating Data Parallel Distributed LLM Training with Layer-wise Scheduled Partial Synchronization
by: Tang, Zhenheng, et al.
Published: (2025) -
BlockRaFT: A Distributed Framework for Fault-Tolerant and Scalable Blockchain Nodes
by: Piduguralla, Manaswini, et al.
Published: (2026) -
Bandwidth-Aware and Cost-Efficient Pipeline Parallel Scheduling in Geo-Distributed LLM Training
by: Zhang, Han, et al.
Published: (2026) -
Task Scheduling in Geo-Distributed Computing: A Survey
by: Wu, Yujian, et al.
Published: (2025) -
Data-Locality-Aware Task Assignment and Scheduling for Distributed Job Executions
by: Zhao, Hailiang, et al.
Published: (2024)