Saved in:
| Main Authors: | Li, Shigang, Ben-Nun, Tal, Di Girolamo, Salvatore, Alistarh, Dan, Hoefler, Torsten |
|---|---|
| Format: | Preprint |
| Published: |
2019
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/1908.04207 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Breaking (Global) Barriers in Parallel Stochastic Optimization with Wait-Avoiding Group Averaging
by: Li, Shigang, et al.
Published: (2020)
by: Li, Shigang, et al.
Published: (2020)
Inductive Loop Analysis for Practical HPC Application Optimization
by: Schaad, Philipp, et al.
Published: (2025)
by: Schaad, Philipp, et al.
Published: (2025)
Near-Optimal Sparse Allreduce for Distributed Deep Learning
by: Li, Shigang, et al.
Published: (2022)
by: Li, Shigang, et al.
Published: (2022)
Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines
by: Li, Shigang, et al.
Published: (2021)
by: Li, Shigang, et al.
Published: (2021)
Network-Offloaded Bandwidth-Optimal Broadcast and Allgather for Distributed AI
by: Khalilov, Mikhail, et al.
Published: (2024)
by: Khalilov, Mikhail, et al.
Published: (2024)
SpaDA: A Spatial Dataflow Architecture Programming Language
by: Gianinazzi, Lukas, et al.
Published: (2025)
by: Gianinazzi, Lukas, et al.
Published: (2025)
Lion Cub: Minimizing Communication Overhead in Distributed Lion
by: Ishikawa, Satoki, et al.
Published: (2024)
by: Ishikawa, Satoki, et al.
Published: (2024)
Low-Depth Spatial Tree Algorithms
by: Baumann, Yves, et al.
Published: (2024)
by: Baumann, Yves, et al.
Published: (2024)
LLMQ: Efficient Lower-Precision Pretraining for Consumer GPUs
by: Schultheis, Erik, et al.
Published: (2025)
by: Schultheis, Erik, et al.
Published: (2025)
Zeppelin: Balancing Variable-length Workloads in Data Parallel Large Model Training
by: Chen, Chang, et al.
Published: (2025)
by: Chen, Chang, et al.
Published: (2025)
PICO: Performance Insights for Collective Operations
by: Pasqualoni, Saverio, et al.
Published: (2025)
by: Pasqualoni, Saverio, et al.
Published: (2025)
Simple Opinion Dynamics for No-Regret Learning
by: Lazarsfeld, John, et al.
Published: (2023)
by: Lazarsfeld, John, et al.
Published: (2023)
AutoDDL: Automatic Distributed Deep Learning with Near-Optimal Bandwidth Cost
by: Chen, Jinfan, et al.
Published: (2023)
by: Chen, Jinfan, et al.
Published: (2023)
Maya: Optimizing Deep Learning Training Workloads using GPU Runtime Emulation
by: Yarlagadda, Srihas, et al.
Published: (2025)
by: Yarlagadda, Srihas, et al.
Published: (2025)
SpComm3D: A Framework for Enabling Sparse Communication in 3D Sparse Kernels
by: Abubaker, Nabil, et al.
Published: (2024)
by: Abubaker, Nabil, et al.
Published: (2024)
CrossPipe: Towards Optimal Pipeline Schedules for Cross-Datacenter Training
by: Chen, Tiancheng, et al.
Published: (2025)
by: Chen, Tiancheng, et al.
Published: (2025)
Hybrid Decentralized Optimization: Leveraging Both First- and Zeroth-Order Optimizers for Faster Convergence
by: Ansaripour, Matin, et al.
Published: (2022)
by: Ansaripour, Matin, et al.
Published: (2022)
Record-Remix-Replay: Hierarchical GPU Kernel Optimization using Evolutionary Search
by: Nichols, Daniel, et al.
Published: (2026)
by: Nichols, Daniel, et al.
Published: (2026)
A Unifying Framework to Enable Artificial Intelligence in High Performance Computing Workflows
by: Domke, Jens, et al.
Published: (2025)
by: Domke, Jens, et al.
Published: (2025)
Federated Learning with Workload Reduction through Partial Training of Client Models and Entropy-Based Data Selection
by: Shi, Hongrui, et al.
Published: (2024)
by: Shi, Hongrui, et al.
Published: (2024)
Prediction-Assisted Online Distributed Deep Learning Workload Scheduling in GPU Clusters
by: Luo, Ziyue, et al.
Published: (2025)
by: Luo, Ziyue, et al.
Published: (2025)
Arrow Matrix Decomposition: A Novel Approach for Communication-Efficient Sparse Matrix Multiplication
by: Gianinazzi, Lukas, et al.
Published: (2024)
by: Gianinazzi, Lukas, et al.
Published: (2024)
Chameleon: Taming Dynamic Operator Sequences for Memory-Intensive LLM Training
by: Wang, Zibo, et al.
Published: (2025)
by: Wang, Zibo, et al.
Published: (2025)
FaaSKeeper: Learning from Building Serverless Services with ZooKeeper as an Example
by: Copik, Marcin, et al.
Published: (2022)
by: Copik, Marcin, et al.
Published: (2022)
xMem: A CPU-Based Approach for Accurate Estimation of GPU Memory in Deep Learning Training Workloads
by: Shi, Jiabo, et al.
Published: (2025)
by: Shi, Jiabo, et al.
Published: (2025)
Taming the Memory Beast: Strategies for Reliable ML Training on Kubernetes
by: Ray, Jaideep
Published: (2024)
by: Ray, Jaideep
Published: (2024)
Cppless: Single-Source and High-Performance Serverless Programming in C++
by: Copik, Marcin, et al.
Published: (2024)
by: Copik, Marcin, et al.
Published: (2024)
Software Resource Disaggregation for HPC with Serverless Computing
by: Copik, Marcin, et al.
Published: (2024)
by: Copik, Marcin, et al.
Published: (2024)
Asynch-SGBDT: Asynchronous Parallel Stochastic Gradient Boosting Decision Tree based on Parameters Server
by: Daning, Cheng, et al.
Published: (2018)
by: Daning, Cheng, et al.
Published: (2018)
Workload-Aware Hardware Accelerator Mining for Distributed Deep Learning Training
by: Adnan, Muhammad, et al.
Published: (2024)
by: Adnan, Muhammad, et al.
Published: (2024)
Improving Automatic Parallel Training via Balanced Memory Workload Optimization
by: Wang, Yujie, et al.
Published: (2023)
by: Wang, Yujie, et al.
Published: (2023)
Accelerating Compound LLM Training Workloads with Maestro
by: Yuan, Xiulong, et al.
Published: (2026)
by: Yuan, Xiulong, et al.
Published: (2026)
Tally: Non-Intrusive Performance Isolation for Concurrent Deep Learning Workloads
by: Zhao, Wei, et al.
Published: (2024)
by: Zhao, Wei, et al.
Published: (2024)
LIBRA: Enabling Workload-aware Multi-dimensional Network Topology Optimization for Distributed Training of Large AI Models
by: Won, William, et al.
Published: (2021)
by: Won, William, et al.
Published: (2021)
Embracing Federated Learning: Enabling Weak Client Participation via Partial Model Training
by: Lee, Sunwoo, et al.
Published: (2024)
by: Lee, Sunwoo, et al.
Published: (2024)
HeteroSwitch: Characterizing and Taming System-Induced Data Heterogeneity in Federated Learning
by: Kim, Gyudong, et al.
Published: (2024)
by: Kim, Gyudong, et al.
Published: (2024)
Understanding Data Movement in Tightly Coupled Heterogeneous Systems: A Case Study with the Grace Hopper Superchip
by: Fusco, Luigi, et al.
Published: (2024)
by: Fusco, Luigi, et al.
Published: (2024)
Partial Federated Learning
by: Feng, Tiantian, et al.
Published: (2024)
by: Feng, Tiantian, et al.
Published: (2024)
BLaST: High Performance Inference and Pretraining using BLock Sparse Transformers
by: Okanovic, Patrik, et al.
Published: (2025)
by: Okanovic, Patrik, et al.
Published: (2025)
Efficient Unified Caching for Accelerating Heterogeneous AI Workloads
by: Wang, Tianze, et al.
Published: (2025)
by: Wang, Tianze, et al.
Published: (2025)
Similar Items
-
Breaking (Global) Barriers in Parallel Stochastic Optimization with Wait-Avoiding Group Averaging
by: Li, Shigang, et al.
Published: (2020) -
Inductive Loop Analysis for Practical HPC Application Optimization
by: Schaad, Philipp, et al.
Published: (2025) -
Near-Optimal Sparse Allreduce for Distributed Deep Learning
by: Li, Shigang, et al.
Published: (2022) -
Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines
by: Li, Shigang, et al.
Published: (2021) -
Network-Offloaded Bandwidth-Optimal Broadcast and Allgather for Distributed AI
by: Khalilov, Mikhail, et al.
Published: (2024)