Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Lu, Ning, Xie, Qian, Zhang, Hao, Fang, Wenyi, Zheng, Yang, Hu, Zheng, Ma, Jiantao
Format:	Preprint
Published:	2024
Subjects:	Distributed, Parallel, and Cluster Computing Artificial Intelligence
Online Access:	https://arxiv.org/abs/2408.07482
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866929532801384448
author	Lu, Ning Xie, Qian Zhang, Hao Fang, Wenyi Zheng, Yang Hu, Zheng Ma, Jiantao
author_facet	Lu, Ning Xie, Qian Zhang, Hao Fang, Wenyi Zheng, Yang Hu, Zheng Ma, Jiantao
contents	Large Language Models (LLMs) are revolutionizing the AI industry with their superior capabilities. Training these models requires large-scale GPU clusters and significant computing time, leading to frequent failures that significantly increase training costs. Despite its significance, this field lacks a metric for evaluating reliability. In this work, we introduce a novel reliability metric called \emph{Training Overhead Ratio} (TOR) to evaluate the reliability of fault-tolerant LLM training systems. TOR is defined as the ratio of optimal training time to the observed training time of a system, serving as a practical tool for users to estimate the actual time required to train an LLM on a given system. Furthermore, our investigation identifies the key factor for enhancing reliability and present TOR equations for various types of failures encountered in practice.
format	Preprint
id	arxiv_https___arxiv_org_abs_2408_07482
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Training Overhead Ratio: A Practical Reliability Metric for Large Language Model Training Systems Lu, Ning Xie, Qian Zhang, Hao Fang, Wenyi Zheng, Yang Hu, Zheng Ma, Jiantao Distributed, Parallel, and Cluster Computing Artificial Intelligence Large Language Models (LLMs) are revolutionizing the AI industry with their superior capabilities. Training these models requires large-scale GPU clusters and significant computing time, leading to frequent failures that significantly increase training costs. Despite its significance, this field lacks a metric for evaluating reliability. In this work, we introduce a novel reliability metric called \emph{Training Overhead Ratio} (TOR) to evaluate the reliability of fault-tolerant LLM training systems. TOR is defined as the ratio of optimal training time to the observed training time of a system, serving as a practical tool for users to estimate the actual time required to train an LLM on a given system. Furthermore, our investigation identifies the key factor for enhancing reliability and present TOR equations for various types of failures encountered in practice.
title	Training Overhead Ratio: A Practical Reliability Metric for Large Language Model Training Systems
topic	Distributed, Parallel, and Cluster Computing Artificial Intelligence
url	https://arxiv.org/abs/2408.07482

Similar Items