:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Li, Shigang, Ben-Nun, Tal, Di Girolamo, Salvatore, Alistarh, Dan, Hoefler, Torsten
Format:	Preprint
Published:	2019
Subjects:	Distributed, Parallel, and Cluster Computing Machine Learning
Online Access:	https://arxiv.org/abs/1908.04207
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Breaking (Global) Barriers in Parallel Stochastic Optimization with Wait-Avoiding Group Averaging
by: Li, Shigang, et al.
Published: (2020)

Inductive Loop Analysis for Practical HPC Application Optimization
by: Schaad, Philipp, et al.
Published: (2025)

Near-Optimal Sparse Allreduce for Distributed Deep Learning
by: Li, Shigang, et al.
Published: (2022)

Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines
by: Li, Shigang, et al.
Published: (2021)

Network-Offloaded Bandwidth-Optimal Broadcast and Allgather for Distributed AI
by: Khalilov, Mikhail, et al.
Published: (2024)

SpaDA: A Spatial Dataflow Architecture Programming Language
by: Gianinazzi, Lukas, et al.
Published: (2025)

Lion Cub: Minimizing Communication Overhead in Distributed Lion
by: Ishikawa, Satoki, et al.
Published: (2024)

Low-Depth Spatial Tree Algorithms
by: Baumann, Yves, et al.
Published: (2024)

LLMQ: Efficient Lower-Precision Pretraining for Consumer GPUs
by: Schultheis, Erik, et al.
Published: (2025)

Zeppelin: Balancing Variable-length Workloads in Data Parallel Large Model Training
by: Chen, Chang, et al.
Published: (2025)

PICO: Performance Insights for Collective Operations
by: Pasqualoni, Saverio, et al.
Published: (2025)

Simple Opinion Dynamics for No-Regret Learning
by: Lazarsfeld, John, et al.
Published: (2023)

AutoDDL: Automatic Distributed Deep Learning with Near-Optimal Bandwidth Cost
by: Chen, Jinfan, et al.
Published: (2023)

Maya: Optimizing Deep Learning Training Workloads using GPU Runtime Emulation
by: Yarlagadda, Srihas, et al.
Published: (2025)

SpComm3D: A Framework for Enabling Sparse Communication in 3D Sparse Kernels
by: Abubaker, Nabil, et al.
Published: (2024)

CrossPipe: Towards Optimal Pipeline Schedules for Cross-Datacenter Training
by: Chen, Tiancheng, et al.
Published: (2025)

Hybrid Decentralized Optimization: Leveraging Both First- and Zeroth-Order Optimizers for Faster Convergence
by: Ansaripour, Matin, et al.
Published: (2022)

Record-Remix-Replay: Hierarchical GPU Kernel Optimization using Evolutionary Search
by: Nichols, Daniel, et al.
Published: (2026)

A Unifying Framework to Enable Artificial Intelligence in High Performance Computing Workflows
by: Domke, Jens, et al.
Published: (2025)

Federated Learning with Workload Reduction through Partial Training of Client Models and Entropy-Based Data Selection
by: Shi, Hongrui, et al.
Published: (2024)

Prediction-Assisted Online Distributed Deep Learning Workload Scheduling in GPU Clusters
by: Luo, Ziyue, et al.
Published: (2025)

Arrow Matrix Decomposition: A Novel Approach for Communication-Efficient Sparse Matrix Multiplication
by: Gianinazzi, Lukas, et al.
Published: (2024)

Chameleon: Taming Dynamic Operator Sequences for Memory-Intensive LLM Training
by: Wang, Zibo, et al.
Published: (2025)

FaaSKeeper: Learning from Building Serverless Services with ZooKeeper as an Example
by: Copik, Marcin, et al.
Published: (2022)

xMem: A CPU-Based Approach for Accurate Estimation of GPU Memory in Deep Learning Training Workloads
by: Shi, Jiabo, et al.
Published: (2025)

Taming the Memory Beast: Strategies for Reliable ML Training on Kubernetes
by: Ray, Jaideep
Published: (2024)

Cppless: Single-Source and High-Performance Serverless Programming in C++
by: Copik, Marcin, et al.
Published: (2024)

Software Resource Disaggregation for HPC with Serverless Computing
by: Copik, Marcin, et al.
Published: (2024)

Asynch-SGBDT: Asynchronous Parallel Stochastic Gradient Boosting Decision Tree based on Parameters Server
by: Daning, Cheng, et al.
Published: (2018)

Workload-Aware Hardware Accelerator Mining for Distributed Deep Learning Training
by: Adnan, Muhammad, et al.
Published: (2024)

Improving Automatic Parallel Training via Balanced Memory Workload Optimization
by: Wang, Yujie, et al.
Published: (2023)

Accelerating Compound LLM Training Workloads with Maestro
by: Yuan, Xiulong, et al.
Published: (2026)

Tally: Non-Intrusive Performance Isolation for Concurrent Deep Learning Workloads
by: Zhao, Wei, et al.
Published: (2024)

LIBRA: Enabling Workload-aware Multi-dimensional Network Topology Optimization for Distributed Training of Large AI Models
by: Won, William, et al.
Published: (2021)

Embracing Federated Learning: Enabling Weak Client Participation via Partial Model Training
by: Lee, Sunwoo, et al.
Published: (2024)

HeteroSwitch: Characterizing and Taming System-Induced Data Heterogeneity in Federated Learning
by: Kim, Gyudong, et al.
Published: (2024)

Understanding Data Movement in Tightly Coupled Heterogeneous Systems: A Case Study with the Grace Hopper Superchip
by: Fusco, Luigi, et al.
Published: (2024)

Partial Federated Learning
by: Feng, Tiantian, et al.
Published: (2024)

BLaST: High Performance Inference and Pretraining using BLock Sparse Transformers
by: Okanovic, Patrik, et al.
Published: (2025)

Efficient Unified Caching for Accelerating Heterogeneous AI Workloads
by: Wang, Tianze, et al.
Published: (2025)