Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Beznosikov, Aleksandr, Horváth, Samuel, Richtárik, Peter, Safaryan, Mher
Format:	Preprint
Published:	2020
Subjects:	Machine Learning Distributed, Parallel, and Cluster Computing Optimization and Control
Online Access:	https://arxiv.org/abs/2002.12410
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917566571610112
author	Beznosikov, Aleksandr Horváth, Samuel Richtárik, Peter Safaryan, Mher
author_facet	Beznosikov, Aleksandr Horváth, Samuel Richtárik, Peter Safaryan, Mher
contents	In the last few years, various communication compression techniques have emerged as an indispensable tool helping to alleviate the communication bottleneck in distributed learning. However, despite the fact biased compressors often show superior performance in practice when compared to the much more studied and understood unbiased compressors, very little is known about them. In this work we study three classes of biased compression operators, two of which are new, and their performance when applied to (stochastic) gradient descent and distributed (stochastic) gradient descent. We show for the first time that biased compressors can lead to linear convergence rates both in the single node and distributed settings. We prove that distributed compressed SGD method, employed with error feedback mechanism, enjoys the ergodic rate $O\left( δL \exp \left[-\frac{μK}{δL}\right] + \frac{(C + δD)}{Kμ}\right)$, where $δ\ge 1$ is a compression parameter which grows when more compression is applied, $L$ and $μ$ are the smoothness and strong convexity constants, $C$ captures stochastic gradient noise ($C=0$ if full gradients are computed on each node) and $D$ captures the variance of the gradients at the optimum ($D=0$ for over-parameterized models). Further, via a theoretical study of several synthetic and empirical distributions of communicated gradients, we shed light on why and by how much biased compressors outperform their unbiased variants. Finally, we propose several new biased compressors with promising theoretical guarantees and practical performance.
format	Preprint
id	arxiv_https___arxiv_org_abs_2002_12410
institution	arXiv
publishDate	2020
record_format	arxiv
spellingShingle	On Biased Compression for Distributed Learning Beznosikov, Aleksandr Horváth, Samuel Richtárik, Peter Safaryan, Mher Machine Learning Distributed, Parallel, and Cluster Computing Optimization and Control In the last few years, various communication compression techniques have emerged as an indispensable tool helping to alleviate the communication bottleneck in distributed learning. However, despite the fact biased compressors often show superior performance in practice when compared to the much more studied and understood unbiased compressors, very little is known about them. In this work we study three classes of biased compression operators, two of which are new, and their performance when applied to (stochastic) gradient descent and distributed (stochastic) gradient descent. We show for the first time that biased compressors can lead to linear convergence rates both in the single node and distributed settings. We prove that distributed compressed SGD method, employed with error feedback mechanism, enjoys the ergodic rate $O\left( δL \exp \left[-\frac{μK}{δL}\right] + \frac{(C + δD)}{Kμ}\right)$, where $δ\ge 1$ is a compression parameter which grows when more compression is applied, $L$ and $μ$ are the smoothness and strong convexity constants, $C$ captures stochastic gradient noise ($C=0$ if full gradients are computed on each node) and $D$ captures the variance of the gradients at the optimum ($D=0$ for over-parameterized models). Further, via a theoretical study of several synthetic and empirical distributions of communicated gradients, we shed light on why and by how much biased compressors outperform their unbiased variants. Finally, we propose several new biased compressors with promising theoretical guarantees and practical performance.
title	On Biased Compression for Distributed Learning
topic	Machine Learning Distributed, Parallel, and Cluster Computing Optimization and Control
url	https://arxiv.org/abs/2002.12410

Similar Items