Salvato in:
Dettagli Bibliografici
Autori principali: Mao, Jiayi, Li, Liqun, Gao, Yanjie, Peng, Zegang, He, Shilin, Zhang, Chaoyun, Qin, Si, Khalid, Samia, Lin, Qingwei, Rajmohan, Saravan, Lanka, Sitaram, Zhang, Dongmei
Natura: Preprint
Pubblicazione: 2025
Soggetti:
Accesso online:https://arxiv.org/abs/2510.10074
Tags: Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!
_version_ 1866915946612916224
author Mao, Jiayi
Li, Liqun
Gao, Yanjie
Peng, Zegang
He, Shilin
Zhang, Chaoyun
Qin, Si
Khalid, Samia
Lin, Qingwei
Rajmohan, Saravan
Lanka, Sitaram
Zhang, Dongmei
author_facet Mao, Jiayi
Li, Liqun
Gao, Yanjie
Peng, Zegang
He, Shilin
Zhang, Chaoyun
Qin, Si
Khalid, Samia
Lin, Qingwei
Rajmohan, Saravan
Lanka, Sitaram
Zhang, Dongmei
contents Effective incident management in large-scale IT systems relies on troubleshooting guides (TSGs), but their manual execution is slow and error-prone. While recent advances in LLMs offer promise for automating incident management tasks, existing LLM-based solutions lack specialized support for several key challenges, including managing TSG quality issues, interpreting complex control flow, handling data-intensive queries, and exploiting execution parallelism. We first conducted an empirical study on 92 real-world TSGs, and, guided by our findings, we present StepFly, a novel end-to-end agentic framework for troubleshooting guide automation. Our approach features a three-stage workflow: the first stage provides a comprehensive guide together with a tool, TSG Mentor, to assist site reliability engineers (SREs) in improving TSG quality; the second stage performs offline preprocessing using LLMs to extract structured execution directed acyclic graphs (DAGs) from unstructured TSGs and to create dedicated Query Preparation Plugins (QPPs); and the third stage executes online using a DAG-guided scheduler-executor framework with a memory system to ensure correct workflow and support parallel execution of independent steps. Our empirical evaluation on a collection of real-world TSGs and incidents demonstrates that StepFly achieves a ~94% success rate on GPT-4.1, outperforming baselines with less time and token consumption. Furthermore, it achieves a remarkable execution time reduction of 32.9% to 70.4% for parallelizable TSGs. Our code and sample data are publicly available at https://github.com/microsoft/StepFly.
format Preprint
id arxiv_https___arxiv_org_abs_2510_10074
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle StepFly: Agentic Troubleshooting Guide Automation for Incident Diagnosis
Mao, Jiayi
Li, Liqun
Gao, Yanjie
Peng, Zegang
He, Shilin
Zhang, Chaoyun
Qin, Si
Khalid, Samia
Lin, Qingwei
Rajmohan, Saravan
Lanka, Sitaram
Zhang, Dongmei
Artificial Intelligence
Effective incident management in large-scale IT systems relies on troubleshooting guides (TSGs), but their manual execution is slow and error-prone. While recent advances in LLMs offer promise for automating incident management tasks, existing LLM-based solutions lack specialized support for several key challenges, including managing TSG quality issues, interpreting complex control flow, handling data-intensive queries, and exploiting execution parallelism. We first conducted an empirical study on 92 real-world TSGs, and, guided by our findings, we present StepFly, a novel end-to-end agentic framework for troubleshooting guide automation. Our approach features a three-stage workflow: the first stage provides a comprehensive guide together with a tool, TSG Mentor, to assist site reliability engineers (SREs) in improving TSG quality; the second stage performs offline preprocessing using LLMs to extract structured execution directed acyclic graphs (DAGs) from unstructured TSGs and to create dedicated Query Preparation Plugins (QPPs); and the third stage executes online using a DAG-guided scheduler-executor framework with a memory system to ensure correct workflow and support parallel execution of independent steps. Our empirical evaluation on a collection of real-world TSGs and incidents demonstrates that StepFly achieves a ~94% success rate on GPT-4.1, outperforming baselines with less time and token consumption. Furthermore, it achieves a remarkable execution time reduction of 32.9% to 70.4% for parallelizable TSGs. Our code and sample data are publicly available at https://github.com/microsoft/StepFly.
title StepFly: Agentic Troubleshooting Guide Automation for Incident Diagnosis
topic Artificial Intelligence
url https://arxiv.org/abs/2510.10074