Saved in:
Bibliographic Details
Main Authors: Lu, Ning, Xie, Qian, Zhang, Hao, Fang, Wenyi, Zheng, Yang, Hu, Zheng, Ma, Jiantao
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2408.07482
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866929532801384448
author Lu, Ning
Xie, Qian
Zhang, Hao
Fang, Wenyi
Zheng, Yang
Hu, Zheng
Ma, Jiantao
author_facet Lu, Ning
Xie, Qian
Zhang, Hao
Fang, Wenyi
Zheng, Yang
Hu, Zheng
Ma, Jiantao
contents Large Language Models (LLMs) are revolutionizing the AI industry with their superior capabilities. Training these models requires large-scale GPU clusters and significant computing time, leading to frequent failures that significantly increase training costs. Despite its significance, this field lacks a metric for evaluating reliability. In this work, we introduce a novel reliability metric called \emph{Training Overhead Ratio} (TOR) to evaluate the reliability of fault-tolerant LLM training systems. TOR is defined as the ratio of optimal training time to the observed training time of a system, serving as a practical tool for users to estimate the actual time required to train an LLM on a given system. Furthermore, our investigation identifies the key factor for enhancing reliability and present TOR equations for various types of failures encountered in practice.
format Preprint
id arxiv_https___arxiv_org_abs_2408_07482
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Training Overhead Ratio: A Practical Reliability Metric for Large Language Model Training Systems
Lu, Ning
Xie, Qian
Zhang, Hao
Fang, Wenyi
Zheng, Yang
Hu, Zheng
Ma, Jiantao
Distributed, Parallel, and Cluster Computing
Artificial Intelligence
Large Language Models (LLMs) are revolutionizing the AI industry with their superior capabilities. Training these models requires large-scale GPU clusters and significant computing time, leading to frequent failures that significantly increase training costs. Despite its significance, this field lacks a metric for evaluating reliability. In this work, we introduce a novel reliability metric called \emph{Training Overhead Ratio} (TOR) to evaluate the reliability of fault-tolerant LLM training systems. TOR is defined as the ratio of optimal training time to the observed training time of a system, serving as a practical tool for users to estimate the actual time required to train an LLM on a given system. Furthermore, our investigation identifies the key factor for enhancing reliability and present TOR equations for various types of failures encountered in practice.
title Training Overhead Ratio: A Practical Reliability Metric for Large Language Model Training Systems
topic Distributed, Parallel, and Cluster Computing
Artificial Intelligence
url https://arxiv.org/abs/2408.07482