Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Feng, Yicheng, Chen, Yuetao, Chen, Kaiwen, Li, Jingzong, Wu, Tianyuan, Cheng, Peng, Wu, Chuan, Wang, Wei, Ho, Tsung-Yi, Xu, Hong
Format:	Preprint
Published:	2024
Subjects:	Machine Learning Distributed, Parallel, and Cluster Computing
Online Access:	https://arxiv.org/abs/2412.12487
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916527145484288
author	Feng, Yicheng Chen, Yuetao Chen, Kaiwen Li, Jingzong Wu, Tianyuan Cheng, Peng Wu, Chuan Wang, Wei Ho, Tsung-Yi Xu, Hong
author_facet	Feng, Yicheng Chen, Yuetao Chen, Kaiwen Li, Jingzong Wu, Tianyuan Cheng, Peng Wu, Chuan Wang, Wei Ho, Tsung-Yi Xu, Hong
contents	Simulation offers unique values for both enumeration and extrapolation purposes, and is becoming increasingly important for managing the massive machine learning (ML) clusters and large-scale distributed training jobs. In this paper, we build Echo to tackle three key challenges in large-scale training simulation: (1) tracing the runtime training workloads at each device in an ex-situ fashion so we can use a single device to obtain the actual execution graphs of 1K-GPU training, (2) accurately estimating the collective communication without high overheads of discrete-event based network simulation, and (3) accounting for the interference-induced computation slowdown from overlapping communication and computation kernels on the same device. Echo delivers on average 8% error in training step -- roughly 3x lower than state-of-the-art simulators -- for GPT-175B on a 96-GPU H800 cluster with 3D parallelism on Megatron-LM under 2 minutes.
format	Preprint
id	arxiv_https___arxiv_org_abs_2412_12487
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Echo: Simulating Distributed Training At Scale Feng, Yicheng Chen, Yuetao Chen, Kaiwen Li, Jingzong Wu, Tianyuan Cheng, Peng Wu, Chuan Wang, Wei Ho, Tsung-Yi Xu, Hong Machine Learning Distributed, Parallel, and Cluster Computing Simulation offers unique values for both enumeration and extrapolation purposes, and is becoming increasingly important for managing the massive machine learning (ML) clusters and large-scale distributed training jobs. In this paper, we build Echo to tackle three key challenges in large-scale training simulation: (1) tracing the runtime training workloads at each device in an ex-situ fashion so we can use a single device to obtain the actual execution graphs of 1K-GPU training, (2) accurately estimating the collective communication without high overheads of discrete-event based network simulation, and (3) accounting for the interference-induced computation slowdown from overlapping communication and computation kernels on the same device. Echo delivers on average 8% error in training step -- roughly 3x lower than state-of-the-art simulators -- for GPT-175B on a 96-GPU H800 cluster with 3D parallelism on Megatron-LM under 2 minutes.
title	Echo: Simulating Distributed Training At Scale
topic	Machine Learning Distributed, Parallel, and Cluster Computing
url	https://arxiv.org/abs/2412.12487

Similar Items