Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Liu, Guanliang, Patni, Abhinandan, Lin, Congzhu, Zeng, Zoe, Wittmayer, Jack, Wu, Josh, Nihalani, Ashvin, Huang, Binxuan, Liu, Yinghong, Na, Rory, Ko, Anthony, Zhipa, Alexander, Cheng, Cong, Sun, Mi, Rajakumar, Vijay, Joseph, Rejith George, Govindarajen, Parthasarathy
Format:	Preprint
Published:	2026
Subjects:	Distributed, Parallel, and Cluster Computing Artificial Intelligence Machine Learning
Online Access:	https://arxiv.org/abs/2605.17879
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913140067794944
author	Liu, Guanliang Patni, Abhinandan Lin, Congzhu Zeng, Zoe Wittmayer, Jack Wu, Josh Nihalani, Ashvin Huang, Binxuan Liu, Yinghong Na, Rory Ko, Anthony Zhipa, Alexander Cheng, Cong Sun, Mi Rajakumar, Vijay Joseph, Rejith George Govindarajen, Parthasarathy
author_facet	Liu, Guanliang Patni, Abhinandan Lin, Congzhu Zeng, Zoe Wittmayer, Jack Wu, Josh Nihalani, Ashvin Huang, Binxuan Liu, Yinghong Na, Rory Ko, Anthony Zhipa, Alexander Cheng, Cong Sun, Mi Rajakumar, Vijay Joseph, Rejith George Govindarajen, Parthasarathy
contents	Training frontier-scale foundation models involves coordinating tens of thousands of GPUs over multi-month runs, where even minor performance degradations can accumulate into substantial efficiency losses. Existing health-check mechanisms, such as NCCL tests or GPU burn-in, primarily focus on functional correctness and often fail to detect fail-slow behaviors that silently degrade system performance. In this paper, we present Guard, a scalable system for detecting stragglers and ensuring node health in large-scale training clusters. Guard combines lightweight online performance monitoring during training with an offline node-sweep mechanism that systematically evaluates and qualifies nodes before they participate in production workloads. This design enables Guard to detect both acute failures and long-running fail-slow behaviors that traditional diagnostics cannot capture. Deployed on large-scale foundation model pretraining workloads, Guard improves mean FLOPs utilization by up to 1.7x, reduces run-to-run training step variance from 20% to 1%, increases mean time to failure (MTTF), and significantly reduces operational and debugging overhead. These results demonstrate that proactive straggler detection and systematic node qualification are critical for maintaining stable and efficient large-scale training.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_17879
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Guard: Scalable Straggler Detection and Node Health Management for Large-Scale Training Liu, Guanliang Patni, Abhinandan Lin, Congzhu Zeng, Zoe Wittmayer, Jack Wu, Josh Nihalani, Ashvin Huang, Binxuan Liu, Yinghong Na, Rory Ko, Anthony Zhipa, Alexander Cheng, Cong Sun, Mi Rajakumar, Vijay Joseph, Rejith George Govindarajen, Parthasarathy Distributed, Parallel, and Cluster Computing Artificial Intelligence Machine Learning Training frontier-scale foundation models involves coordinating tens of thousands of GPUs over multi-month runs, where even minor performance degradations can accumulate into substantial efficiency losses. Existing health-check mechanisms, such as NCCL tests or GPU burn-in, primarily focus on functional correctness and often fail to detect fail-slow behaviors that silently degrade system performance. In this paper, we present Guard, a scalable system for detecting stragglers and ensuring node health in large-scale training clusters. Guard combines lightweight online performance monitoring during training with an offline node-sweep mechanism that systematically evaluates and qualifies nodes before they participate in production workloads. This design enables Guard to detect both acute failures and long-running fail-slow behaviors that traditional diagnostics cannot capture. Deployed on large-scale foundation model pretraining workloads, Guard improves mean FLOPs utilization by up to 1.7x, reduces run-to-run training step variance from 20% to 1%, increases mean time to failure (MTTF), and significantly reduces operational and debugging overhead. These results demonstrate that proactive straggler detection and systematic node qualification are critical for maintaining stable and efficient large-scale training.
title	Guard: Scalable Straggler Detection and Node Health Management for Large-Scale Training
topic	Distributed, Parallel, and Cluster Computing Artificial Intelligence Machine Learning
url	https://arxiv.org/abs/2605.17879

Similar Items