Saved in:
Bibliographic Details
Main Authors: Liu, Guanliang, Patni, Abhinandan, Lin, Congzhu, Zeng, Zoe, Wittmayer, Jack, Wu, Josh, Nihalani, Ashvin, Huang, Binxuan, Liu, Yinghong, Na, Rory, Ko, Anthony, Zhipa, Alexander, Cheng, Cong, Sun, Mi, Rajakumar, Vijay, Joseph, Rejith George, Govindarajen, Parthasarathy
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2605.17879
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866913140067794944
author Liu, Guanliang
Patni, Abhinandan
Lin, Congzhu
Zeng, Zoe
Wittmayer, Jack
Wu, Josh
Nihalani, Ashvin
Huang, Binxuan
Liu, Yinghong
Na, Rory
Ko, Anthony
Zhipa, Alexander
Cheng, Cong
Sun, Mi
Rajakumar, Vijay
Joseph, Rejith George
Govindarajen, Parthasarathy
author_facet Liu, Guanliang
Patni, Abhinandan
Lin, Congzhu
Zeng, Zoe
Wittmayer, Jack
Wu, Josh
Nihalani, Ashvin
Huang, Binxuan
Liu, Yinghong
Na, Rory
Ko, Anthony
Zhipa, Alexander
Cheng, Cong
Sun, Mi
Rajakumar, Vijay
Joseph, Rejith George
Govindarajen, Parthasarathy
contents Training frontier-scale foundation models involves coordinating tens of thousands of GPUs over multi-month runs, where even minor performance degradations can accumulate into substantial efficiency losses. Existing health-check mechanisms, such as NCCL tests or GPU burn-in, primarily focus on functional correctness and often fail to detect fail-slow behaviors that silently degrade system performance. In this paper, we present Guard, a scalable system for detecting stragglers and ensuring node health in large-scale training clusters. Guard combines lightweight online performance monitoring during training with an offline node-sweep mechanism that systematically evaluates and qualifies nodes before they participate in production workloads. This design enables Guard to detect both acute failures and long-running fail-slow behaviors that traditional diagnostics cannot capture. Deployed on large-scale foundation model pretraining workloads, Guard improves mean FLOPs utilization by up to 1.7x, reduces run-to-run training step variance from 20% to 1%, increases mean time to failure (MTTF), and significantly reduces operational and debugging overhead. These results demonstrate that proactive straggler detection and systematic node qualification are critical for maintaining stable and efficient large-scale training.
format Preprint
id arxiv_https___arxiv_org_abs_2605_17879
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Guard: Scalable Straggler Detection and Node Health Management for Large-Scale Training
Liu, Guanliang
Patni, Abhinandan
Lin, Congzhu
Zeng, Zoe
Wittmayer, Jack
Wu, Josh
Nihalani, Ashvin
Huang, Binxuan
Liu, Yinghong
Na, Rory
Ko, Anthony
Zhipa, Alexander
Cheng, Cong
Sun, Mi
Rajakumar, Vijay
Joseph, Rejith George
Govindarajen, Parthasarathy
Distributed, Parallel, and Cluster Computing
Artificial Intelligence
Machine Learning
Training frontier-scale foundation models involves coordinating tens of thousands of GPUs over multi-month runs, where even minor performance degradations can accumulate into substantial efficiency losses. Existing health-check mechanisms, such as NCCL tests or GPU burn-in, primarily focus on functional correctness and often fail to detect fail-slow behaviors that silently degrade system performance. In this paper, we present Guard, a scalable system for detecting stragglers and ensuring node health in large-scale training clusters. Guard combines lightweight online performance monitoring during training with an offline node-sweep mechanism that systematically evaluates and qualifies nodes before they participate in production workloads. This design enables Guard to detect both acute failures and long-running fail-slow behaviors that traditional diagnostics cannot capture. Deployed on large-scale foundation model pretraining workloads, Guard improves mean FLOPs utilization by up to 1.7x, reduces run-to-run training step variance from 20% to 1%, increases mean time to failure (MTTF), and significantly reduces operational and debugging overhead. These results demonstrate that proactive straggler detection and systematic node qualification are critical for maintaining stable and efficient large-scale training.
title Guard: Scalable Straggler Detection and Node Health Management for Large-Scale Training
topic Distributed, Parallel, and Cluster Computing
Artificial Intelligence
Machine Learning
url https://arxiv.org/abs/2605.17879