Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Liyanaarachchi, Sahan, Thilakarathna, Kanchana, Ulukus, Sennur
Format:	Preprint
Published:	2024
Subjects:	Machine Learning Distributed, Parallel, and Cluster Computing Information Theory
Online Access:	https://arxiv.org/abs/2405.15744
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910458793951232
author	Liyanaarachchi, Sahan Thilakarathna, Kanchana Ulukus, Sennur
author_facet	Liyanaarachchi, Sahan Thilakarathna, Kanchana Ulukus, Sennur
contents	In many federated learning (FL) models, a common strategy employed to ensure the progress in the training process, is to wait for at least $M$ clients out of the total $N$ clients to send back their local gradients based on a reporting deadline $T$, once the parameter server (PS) has broadcasted the global model. If enough clients do not report back within the deadline, the particular round is considered to be a failed round and the training round is restarted from scratch. If enough clients have responded back, the round is deemed successful and the local gradients of all the clients that responded back are used to update the global model. In either case, the clients that failed to report back an update within the deadline would have wasted their computational resources. Having a tighter deadline (small $T$) and waiting for a larger number of participating clients (large $M$) leads to a large number of failed rounds and therefore greater communication cost and computation resource wastage. However, having a larger $T$ leads to longer round durations whereas smaller $M$ may lead to noisy gradients. Therefore, there is a need to optimize the parameters $M$ and $T$ such that communication cost and the resource wastage is minimized while having an acceptable convergence rate. In this regard, we show that the average age of a client at the PS appears explicitly in the theoretical convergence bound, and therefore, can be used as a metric to quantify the convergence of the global model. We provide an analytical scheme to select the parameters $M$ and $T$ in this setting.
format	Preprint
id	arxiv_https___arxiv_org_abs_2405_15744
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	CAFe: Cost and Age aware Federated Learning Liyanaarachchi, Sahan Thilakarathna, Kanchana Ulukus, Sennur Machine Learning Distributed, Parallel, and Cluster Computing Information Theory In many federated learning (FL) models, a common strategy employed to ensure the progress in the training process, is to wait for at least $M$ clients out of the total $N$ clients to send back their local gradients based on a reporting deadline $T$, once the parameter server (PS) has broadcasted the global model. If enough clients do not report back within the deadline, the particular round is considered to be a failed round and the training round is restarted from scratch. If enough clients have responded back, the round is deemed successful and the local gradients of all the clients that responded back are used to update the global model. In either case, the clients that failed to report back an update within the deadline would have wasted their computational resources. Having a tighter deadline (small $T$) and waiting for a larger number of participating clients (large $M$) leads to a large number of failed rounds and therefore greater communication cost and computation resource wastage. However, having a larger $T$ leads to longer round durations whereas smaller $M$ may lead to noisy gradients. Therefore, there is a need to optimize the parameters $M$ and $T$ such that communication cost and the resource wastage is minimized while having an acceptable convergence rate. In this regard, we show that the average age of a client at the PS appears explicitly in the theoretical convergence bound, and therefore, can be used as a metric to quantify the convergence of the global model. We provide an analytical scheme to select the parameters $M$ and $T$ in this setting.
title	CAFe: Cost and Age aware Federated Learning
topic	Machine Learning Distributed, Parallel, and Cluster Computing Information Theory
url	https://arxiv.org/abs/2405.15744

Similar Items