Affichage MARC: :: Library Catalog

Enregistré dans:

Détails bibliographiques
Auteurs principaux:	Xiong, Yifan, Jiang, Yuting, Yang, Ziyue, Qu, Lei, Zhao, Guoshuai, Liu, Shuguang, Zhong, Dong, Pinzur, Boris, Zhang, Jie, Wang, Yang, Jose, Jithin, Pourreza, Hossein, Baxter, Jeff, Datta, Kushal, Ram, Prabhat, Melton, Luke, Chau, Joe, Cheng, Peng, Xiong, Yongqiang, Zhou, Lidong
Format:	Preprint
Publié:	2024
Sujets:	Distributed, Parallel, and Cluster Computing
Accès en ligne:	https://arxiv.org/abs/2402.06194
Tags:	Ajouter un tag Pas de tags, Soyez le premier à ajouter un tag!

_version_	1866910477255180288
author	Xiong, Yifan Jiang, Yuting Yang, Ziyue Qu, Lei Zhao, Guoshuai Liu, Shuguang Zhong, Dong Pinzur, Boris Zhang, Jie Wang, Yang Jose, Jithin Pourreza, Hossein Baxter, Jeff Datta, Kushal Ram, Prabhat Melton, Luke Chau, Joe Cheng, Peng Xiong, Yongqiang Zhou, Lidong
author_facet	Xiong, Yifan Jiang, Yuting Yang, Ziyue Qu, Lei Zhao, Guoshuai Liu, Shuguang Zhong, Dong Pinzur, Boris Zhang, Jie Wang, Yang Jose, Jithin Pourreza, Hossein Baxter, Jeff Datta, Kushal Ram, Prabhat Melton, Luke Chau, Joe Cheng, Peng Xiong, Yongqiang Zhou, Lidong
contents	Reliability in cloud AI infrastructure is crucial for cloud service providers, prompting the widespread use of hardware redundancies. However, these redundancies can inadvertently lead to hidden degradation, so called "gray failure", for AI workloads, significantly affecting end-to-end performance and concealing performance issues, which complicates root cause analysis for failures and regressions. We introduce SuperBench, a proactive validation system for AI infrastructure that mitigates hidden degradation caused by hardware redundancies and enhances overall reliability. SuperBench features a comprehensive benchmark suite, capable of evaluating individual hardware components and representing most real AI workloads. It comprises a Validator which learns benchmark criteria to clearly pinpoint defective components. Additionally, SuperBench incorporates a Selector to balance validation time and issue-related penalties, enabling optimal timing for validation execution with a tailored subset of benchmarks. Through testbed evaluation and simulation, we demonstrate that SuperBench can increase the mean time between incidents by up to 22.61x. SuperBench has been successfully deployed in Azure production, validating hundreds of thousands of GPUs over the last two years.
format	Preprint
id	arxiv_https___arxiv_org_abs_2402_06194
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	SuperBench: Improving Cloud AI Infrastructure Reliability with Proactive Validation Xiong, Yifan Jiang, Yuting Yang, Ziyue Qu, Lei Zhao, Guoshuai Liu, Shuguang Zhong, Dong Pinzur, Boris Zhang, Jie Wang, Yang Jose, Jithin Pourreza, Hossein Baxter, Jeff Datta, Kushal Ram, Prabhat Melton, Luke Chau, Joe Cheng, Peng Xiong, Yongqiang Zhou, Lidong Distributed, Parallel, and Cluster Computing Reliability in cloud AI infrastructure is crucial for cloud service providers, prompting the widespread use of hardware redundancies. However, these redundancies can inadvertently lead to hidden degradation, so called "gray failure", for AI workloads, significantly affecting end-to-end performance and concealing performance issues, which complicates root cause analysis for failures and regressions. We introduce SuperBench, a proactive validation system for AI infrastructure that mitigates hidden degradation caused by hardware redundancies and enhances overall reliability. SuperBench features a comprehensive benchmark suite, capable of evaluating individual hardware components and representing most real AI workloads. It comprises a Validator which learns benchmark criteria to clearly pinpoint defective components. Additionally, SuperBench incorporates a Selector to balance validation time and issue-related penalties, enabling optimal timing for validation execution with a tailored subset of benchmarks. Through testbed evaluation and simulation, we demonstrate that SuperBench can increase the mean time between incidents by up to 22.61x. SuperBench has been successfully deployed in Azure production, validating hundreds of thousands of GPUs over the last two years.
title	SuperBench: Improving Cloud AI Infrastructure Reliability with Proactive Validation
topic	Distributed, Parallel, and Cluster Computing
url	https://arxiv.org/abs/2402.06194

Documents similaires