Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Tan, Qiao, Zhu, Feng, Zhang, Jingjing
Format:	Preprint
Published:	2023
Subjects:	Distributed, Parallel, and Cluster Computing
Online Access:	https://arxiv.org/abs/2301.08895
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866929215013650432
author	Tan, Qiao Zhu, Feng Zhang, Jingjing
author_facet	Tan, Qiao Zhu, Feng Zhang, Jingjing
contents	Wall-clock convergence time and communication rounds are critical performance metrics in distributed learning with parameter-server setting. While synchronous methods converge fast but are not robust to stragglers; and asynchronous ones can reduce the wall-clock time per round but suffers from degraded convergence rate due to the staleness of gradients, it is natural to combine the two methods to achieve a balance. In this work, we develop a novel asynchronous strategy that leverages the advantages of both synchronous methods and asynchronous ones, named adaptive bounded staleness (ABS). The key enablers of ABS are two-fold. First, the number of workers that the PS waits for per round for gradient aggregation is adaptively selected to strike a straggling-staleness balance. Second, the workers with relatively high staleness are required to start a new round of computation to alleviate the negative effect of staleness. Simulation results are provided to demonstrate the superiority of ABS over state-of-the-art schemes in terms of wall-clock time and communication rounds.
format	Preprint
id	arxiv_https___arxiv_org_abs_2301_08895
institution	arXiv
publishDate	2023
record_format	arxiv
spellingShingle	ABS: Adaptive Bounded Staleness Converges Faster and Communicates Less Tan, Qiao Zhu, Feng Zhang, Jingjing Distributed, Parallel, and Cluster Computing Wall-clock convergence time and communication rounds are critical performance metrics in distributed learning with parameter-server setting. While synchronous methods converge fast but are not robust to stragglers; and asynchronous ones can reduce the wall-clock time per round but suffers from degraded convergence rate due to the staleness of gradients, it is natural to combine the two methods to achieve a balance. In this work, we develop a novel asynchronous strategy that leverages the advantages of both synchronous methods and asynchronous ones, named adaptive bounded staleness (ABS). The key enablers of ABS are two-fold. First, the number of workers that the PS waits for per round for gradient aggregation is adaptively selected to strike a straggling-staleness balance. Second, the workers with relatively high staleness are required to start a new round of computation to alleviate the negative effect of staleness. Simulation results are provided to demonstrate the superiority of ABS over state-of-the-art schemes in terms of wall-clock time and communication rounds.
title	ABS: Adaptive Bounded Staleness Converges Faster and Communicates Less
topic	Distributed, Parallel, and Cluster Computing
url	https://arxiv.org/abs/2301.08895

Similar Items