Guardado en:
Detalles Bibliográficos
Autores principales: Cheng, Ziheng, Glasgow, Margalit
Formato: Preprint
Publicado: 2024
Materias:
Acceso en línea:https://arxiv.org/abs/2409.13155
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
_version_ 1866912230213156864
author Cheng, Ziheng
Glasgow, Margalit
author_facet Cheng, Ziheng
Glasgow, Margalit
contents We study distributed adaptive algorithms with local updates (intermittent communication). Despite the great empirical success of adaptive methods in distributed training of modern machine learning models, the theoretical benefits of local updates within adaptive methods, particularly in terms of reducing communication complexity, have not been fully understood yet. In this paper, for the first time, we prove that \em Local SGD \em with momentum (\em Local \em SGDM) and \em Local \em Adam can outperform their minibatch counterparts in convex and weakly convex settings in certain regimes, respectively. Our analysis relies on a novel technique to prove contraction during local iterations, which is a crucial yet challenging step to show the advantages of local updates, under generalized smoothness assumption and gradient clipping strategy.
format Preprint
id arxiv_https___arxiv_org_abs_2409_13155
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Convergence of Distributed Adaptive Optimization with Local Updates
Cheng, Ziheng
Glasgow, Margalit
Machine Learning
Optimization and Control
We study distributed adaptive algorithms with local updates (intermittent communication). Despite the great empirical success of adaptive methods in distributed training of modern machine learning models, the theoretical benefits of local updates within adaptive methods, particularly in terms of reducing communication complexity, have not been fully understood yet. In this paper, for the first time, we prove that \em Local SGD \em with momentum (\em Local \em SGDM) and \em Local \em Adam can outperform their minibatch counterparts in convex and weakly convex settings in certain regimes, respectively. Our analysis relies on a novel technique to prove contraction during local iterations, which is a crucial yet challenging step to show the advantages of local updates, under generalized smoothness assumption and gradient clipping strategy.
title Convergence of Distributed Adaptive Optimization with Local Updates
topic Machine Learning
Optimization and Control
url https://arxiv.org/abs/2409.13155