Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Preprint |
| Published: |
2023
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2301.08895 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866929215013650432 |
|---|---|
| author | Tan, Qiao Zhu, Feng Zhang, Jingjing |
| author_facet | Tan, Qiao Zhu, Feng Zhang, Jingjing |
| contents | Wall-clock convergence time and communication rounds are critical performance metrics in distributed learning with parameter-server setting. While synchronous methods converge fast but are not robust to stragglers; and asynchronous ones can reduce the wall-clock time per round but suffers from degraded convergence rate due to the staleness of gradients, it is natural to combine the two methods to achieve a balance. In this work, we develop a novel asynchronous strategy that leverages the advantages of both synchronous methods and asynchronous ones, named adaptive bounded staleness (ABS). The key enablers of ABS are two-fold. First, the number of workers that the PS waits for per round for gradient aggregation is adaptively selected to strike a straggling-staleness balance. Second, the workers with relatively high staleness are required to start a new round of computation to alleviate the negative effect of staleness. Simulation results are provided to demonstrate the superiority of ABS over state-of-the-art schemes in terms of wall-clock time and communication rounds. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2301_08895 |
| institution | arXiv |
| publishDate | 2023 |
| record_format | arxiv |
| spellingShingle | ABS: Adaptive Bounded Staleness Converges Faster and Communicates Less Tan, Qiao Zhu, Feng Zhang, Jingjing Distributed, Parallel, and Cluster Computing Wall-clock convergence time and communication rounds are critical performance metrics in distributed learning with parameter-server setting. While synchronous methods converge fast but are not robust to stragglers; and asynchronous ones can reduce the wall-clock time per round but suffers from degraded convergence rate due to the staleness of gradients, it is natural to combine the two methods to achieve a balance. In this work, we develop a novel asynchronous strategy that leverages the advantages of both synchronous methods and asynchronous ones, named adaptive bounded staleness (ABS). The key enablers of ABS are two-fold. First, the number of workers that the PS waits for per round for gradient aggregation is adaptively selected to strike a straggling-staleness balance. Second, the workers with relatively high staleness are required to start a new round of computation to alleviate the negative effect of staleness. Simulation results are provided to demonstrate the superiority of ABS over state-of-the-art schemes in terms of wall-clock time and communication rounds. |
| title | ABS: Adaptive Bounded Staleness Converges Faster and Communicates Less |
| topic | Distributed, Parallel, and Cluster Computing |
| url | https://arxiv.org/abs/2301.08895 |