Saved in:
Bibliographic Details
Main Authors: Ding, Yi, Gao, Aijia, Ryden, Thibaud, Sedlak, Michal, Ewaisha, Essam, Marnat, Igor, Hoffmann, Henry
Format: Preprint
Published: 2022
Subjects:
Online Access:https://arxiv.org/abs/2212.05155
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866913133032898560
author Ding, Yi
Gao, Aijia
Ryden, Thibaud
Sedlak, Michal
Ewaisha, Essam
Marnat, Igor
Hoffmann, Henry
author_facet Ding, Yi
Gao, Aijia
Ryden, Thibaud
Sedlak, Michal
Ewaisha, Essam
Marnat, Igor
Hoffmann, Henry
contents Software upgrades are critical to maintaining server reliability in datacenters. While job duration prediction and scheduling have been extensively studied, the unique challenges posed by software upgrades remain largely under-explored. This paper presents the first in-depth investigation into software upgrade scheduling at datacenter scale. We begin by characterizing various types of upgrades and then frame the scheduling task as a constrained optimization problem. To address this problem, we introduce Acela, a cost-aware duration prediction framework designed to improve upgrade scheduling efficiency and throughput while meeting service-level objectives (SLOs). Acela accounts for asymmetric misprediction costs, strategically selects the best predictive models, and mitigates straggler-induced overestimations. Evaluations on Meta's production datacenter systems demonstrate that Acela significantly increases efficiency of the existing upgrade scheduler by improving upgrade window utilization by 1.25X, increasing the number of scheduled and completed upgrades by 33% and 41%, and reducing cancellation rates by 2.4X. The code and data sets will be released after paper acceptance.
format Preprint
id arxiv_https___arxiv_org_abs_2212_05155
institution arXiv
publishDate 2022
record_format arxiv
spellingShingle Cost-aware Duration Prediction for Software Upgrades in Datacenters
Ding, Yi
Gao, Aijia
Ryden, Thibaud
Sedlak, Michal
Ewaisha, Essam
Marnat, Igor
Hoffmann, Henry
Distributed, Parallel, and Cluster Computing
Machine Learning
Software upgrades are critical to maintaining server reliability in datacenters. While job duration prediction and scheduling have been extensively studied, the unique challenges posed by software upgrades remain largely under-explored. This paper presents the first in-depth investigation into software upgrade scheduling at datacenter scale. We begin by characterizing various types of upgrades and then frame the scheduling task as a constrained optimization problem. To address this problem, we introduce Acela, a cost-aware duration prediction framework designed to improve upgrade scheduling efficiency and throughput while meeting service-level objectives (SLOs). Acela accounts for asymmetric misprediction costs, strategically selects the best predictive models, and mitigates straggler-induced overestimations. Evaluations on Meta's production datacenter systems demonstrate that Acela significantly increases efficiency of the existing upgrade scheduler by improving upgrade window utilization by 1.25X, increasing the number of scheduled and completed upgrades by 33% and 41%, and reducing cancellation rates by 2.4X. The code and data sets will be released after paper acceptance.
title Cost-aware Duration Prediction for Software Upgrades in Datacenters
topic Distributed, Parallel, and Cluster Computing
Machine Learning
url https://arxiv.org/abs/2212.05155