Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Ranganathan, Bhala, Zhang, Mickey, Wu, Kai
Format:	Preprint
Published:	2025
Subjects:	Distributed, Parallel, and Cluster Computing Computers and Society
Online Access:	https://arxiv.org/abs/2511.07424
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915610465665024
author	Ranganathan, Bhala Zhang, Mickey Wu, Kai
author_facet	Ranganathan, Bhala Zhang, Mickey Wu, Kai
contents	Hyperscale large language model (LLM) inference places extraordinary demands on cloud systems, where even brief failures can translate into significant user and business impact. To better understand and mitigate these risks, we present one of the first provider-internal, practice-based analysis of LLM inference incidents. We developed a taxonomy and methodology grounded in a year of operational experience, validating it on 156 high-severity incidents, and conducted a focused quantitative study of Apr-Jun 2025 to ensure recency and relevance. Our approach achieves high labeling consistency (Cohen's K ~0.89), identifies dominant failure modes (in our dataset ~60% inference engine failures, within that category ~40% timeouts), and surfaces mitigation levers (~74% auto-detected; ~28% required hotfix). Beyond hotfixes, many incidents were mitigated via traffic routing, node rebalancing, or capacity increase policies, indicating further automation opportunities. We also show how the taxonomy guided targeted strategies such as connection liveness, GPU capacity-aware routing, and per-endpoint isolation and reduced incident impact and accelerated recovery. Finally, we contribute a practitioner-oriented adoption checklist that enables others to replicate our taxonomy, analysis, and automation opportunities in their own systems. This study demonstrates how systematic, empirically grounded analysis of inference operations can drive more reliable and cost-efficient LLM serving at scale.
format	Preprint
id	arxiv_https___arxiv_org_abs_2511_07424
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Enhancing reliability in AI inference services: An empirical study on real production incidents Ranganathan, Bhala Zhang, Mickey Wu, Kai Distributed, Parallel, and Cluster Computing Computers and Society Hyperscale large language model (LLM) inference places extraordinary demands on cloud systems, where even brief failures can translate into significant user and business impact. To better understand and mitigate these risks, we present one of the first provider-internal, practice-based analysis of LLM inference incidents. We developed a taxonomy and methodology grounded in a year of operational experience, validating it on 156 high-severity incidents, and conducted a focused quantitative study of Apr-Jun 2025 to ensure recency and relevance. Our approach achieves high labeling consistency (Cohen's K ~0.89), identifies dominant failure modes (in our dataset ~60% inference engine failures, within that category ~40% timeouts), and surfaces mitigation levers (~74% auto-detected; ~28% required hotfix). Beyond hotfixes, many incidents were mitigated via traffic routing, node rebalancing, or capacity increase policies, indicating further automation opportunities. We also show how the taxonomy guided targeted strategies such as connection liveness, GPU capacity-aware routing, and per-endpoint isolation and reduced incident impact and accelerated recovery. Finally, we contribute a practitioner-oriented adoption checklist that enables others to replicate our taxonomy, analysis, and automation opportunities in their own systems. This study demonstrates how systematic, empirically grounded analysis of inference operations can drive more reliable and cost-efficient LLM serving at scale.
title	Enhancing reliability in AI inference services: An empirical study on real production incidents
topic	Distributed, Parallel, and Cluster Computing Computers and Society
url	https://arxiv.org/abs/2511.07424

Similar Items