Saved in:
Bibliographic Details
Main Authors: He, Zilong, Chen, Pengfei, Zhang, Hongyu, Li, Xiaoyun, Yu, Guangba, Chen, Hongyang, Zheng, Zibin
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2507.01628
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866915369466200064
author He, Zilong
Chen, Pengfei
Zhang, Hongyu
Li, Xiaoyun
Yu, Guangba
Chen, Hongyang
Zheng, Zibin
author_facet He, Zilong
Chen, Pengfei
Zhang, Hongyu
Li, Xiaoyun
Yu, Guangba
Chen, Hongyang
Zheng, Zibin
contents Deep learning (DL) systems have been widely adopted in many areas, and are becoming even more popular with the emergence of large language models. However, due to the complex software stacks involved in their development and execution, crashes are unavoidable and common. Crashes severely waste computing resources and hinder development productivity, so efficient crash recovery is crucial. Existing solutions, such as checkpoint-retry, are too heavyweight for fast recovery from crashes caused by minor programming errors or transient runtime errors. Therefore, we present DaiFu, an in-situ recovery framework for DL systems. Through a lightweight code transformation to a given DL system, DaiFu augments it to intercept crashes in situ and enables dynamic and instant updates to its program running context (e.g., code, configurations, and other data) for agile crash recovery. Our evaluation shows that DaiFu helps reduce the restore time for crash recovery, achieving a 1372x speedup compared with state-of-the-art solutions. Meanwhile, the overhead of DaiFu is negligible (under 0.40%). We also construct a benchmark spanning 7 distinct crash scenarios in DL systems, and show the effectiveness of DaiFu in diverse situations.
format Preprint
id arxiv_https___arxiv_org_abs_2507_01628
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle DaiFu: In-Situ Crash Recovery for Deep Learning Systems
He, Zilong
Chen, Pengfei
Zhang, Hongyu
Li, Xiaoyun
Yu, Guangba
Chen, Hongyang
Zheng, Zibin
Software Engineering
Deep learning (DL) systems have been widely adopted in many areas, and are becoming even more popular with the emergence of large language models. However, due to the complex software stacks involved in their development and execution, crashes are unavoidable and common. Crashes severely waste computing resources and hinder development productivity, so efficient crash recovery is crucial. Existing solutions, such as checkpoint-retry, are too heavyweight for fast recovery from crashes caused by minor programming errors or transient runtime errors. Therefore, we present DaiFu, an in-situ recovery framework for DL systems. Through a lightweight code transformation to a given DL system, DaiFu augments it to intercept crashes in situ and enables dynamic and instant updates to its program running context (e.g., code, configurations, and other data) for agile crash recovery. Our evaluation shows that DaiFu helps reduce the restore time for crash recovery, achieving a 1372x speedup compared with state-of-the-art solutions. Meanwhile, the overhead of DaiFu is negligible (under 0.40%). We also construct a benchmark spanning 7 distinct crash scenarios in DL systems, and show the effectiveness of DaiFu in diverse situations.
title DaiFu: In-Situ Crash Recovery for Deep Learning Systems
topic Software Engineering
url https://arxiv.org/abs/2507.01628