Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Chen, Zixuan, Liu, Xuandong, Li, Minglin, Hu, Yinfan, Mei, Hao, Xing, Huifeng, Wang, Hao, Shi, Wanxin, Liu, Sen, Xu, Yang
Format:	Preprint
Published:	2024
Subjects:	Networking and Internet Architecture Artificial Intelligence Distributed, Parallel, and Cluster Computing
Online Access:	https://arxiv.org/abs/2407.19721
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909272341741568
author	Chen, Zixuan Liu, Xuandong Li, Minglin Hu, Yinfan Mei, Hao Xing, Huifeng Wang, Hao Shi, Wanxin Liu, Sen Xu, Yang
author_facet	Chen, Zixuan Liu, Xuandong Li, Minglin Hu, Yinfan Mei, Hao Xing, Huifeng Wang, Hao Shi, Wanxin Liu, Sen Xu, Yang
contents	Parameter Server (PS) and Ring-AllReduce (RAR) are two widely utilized synchronization architectures in multi-worker Deep Learning (DL), also referred to as Distributed Deep Learning (DDL). However, PS encounters challenges with the ``incast'' issue, while RAR struggles with problems caused by the long dependency chain. The emerging In-network Aggregation (INA) has been proposed to integrate with PS to mitigate its incast issue. However, such PS-based INA has poor incremental deployment abilities as it requires replacing all the switches to show significant performance improvement, which is not cost-effective. In this study, we present the incorporation of INA capabilities into RAR, called RAR with In-Network Aggregation (Rina), to tackle both the problems above. Rina features its agent-worker mechanism. When an INA-capable ToR switch is deployed, all workers in this rack run as one abstracted worker with the help of the agent, resulting in both excellent incremental deployment capabilities and better throughput. We conducted extensive testbed and simulation evaluations to substantiate the throughput advantages of Rina over existing DDL training synchronization structures. Compared with the state-of-the-art PS-based INA methods ATP, Rina can achieve more than 50\% throughput with the same hardware cost.
format	Preprint
id	arxiv_https___arxiv_org_abs_2407_19721
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Rina: Enhancing Ring-AllReduce with In-network Aggregation in Distributed Model Training Chen, Zixuan Liu, Xuandong Li, Minglin Hu, Yinfan Mei, Hao Xing, Huifeng Wang, Hao Shi, Wanxin Liu, Sen Xu, Yang Networking and Internet Architecture Artificial Intelligence Distributed, Parallel, and Cluster Computing Parameter Server (PS) and Ring-AllReduce (RAR) are two widely utilized synchronization architectures in multi-worker Deep Learning (DL), also referred to as Distributed Deep Learning (DDL). However, PS encounters challenges with the ``incast'' issue, while RAR struggles with problems caused by the long dependency chain. The emerging In-network Aggregation (INA) has been proposed to integrate with PS to mitigate its incast issue. However, such PS-based INA has poor incremental deployment abilities as it requires replacing all the switches to show significant performance improvement, which is not cost-effective. In this study, we present the incorporation of INA capabilities into RAR, called RAR with In-Network Aggregation (Rina), to tackle both the problems above. Rina features its agent-worker mechanism. When an INA-capable ToR switch is deployed, all workers in this rack run as one abstracted worker with the help of the agent, resulting in both excellent incremental deployment capabilities and better throughput. We conducted extensive testbed and simulation evaluations to substantiate the throughput advantages of Rina over existing DDL training synchronization structures. Compared with the state-of-the-art PS-based INA methods ATP, Rina can achieve more than 50\% throughput with the same hardware cost.
title	Rina: Enhancing Ring-AllReduce with In-network Aggregation in Distributed Model Training
topic	Networking and Internet Architecture Artificial Intelligence Distributed, Parallel, and Cluster Computing
url	https://arxiv.org/abs/2407.19721

Similar Items