Saved in:
Bibliographic Details
Main Authors: Ranganathan, Bhala, Zhang, Mickey, Wu, Kai
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2511.07424
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866915610465665024
author Ranganathan, Bhala
Zhang, Mickey
Wu, Kai
author_facet Ranganathan, Bhala
Zhang, Mickey
Wu, Kai
contents Hyperscale large language model (LLM) inference places extraordinary demands on cloud systems, where even brief failures can translate into significant user and business impact. To better understand and mitigate these risks, we present one of the first provider-internal, practice-based analysis of LLM inference incidents. We developed a taxonomy and methodology grounded in a year of operational experience, validating it on 156 high-severity incidents, and conducted a focused quantitative study of Apr-Jun 2025 to ensure recency and relevance. Our approach achieves high labeling consistency (Cohen's K ~0.89), identifies dominant failure modes (in our dataset ~60% inference engine failures, within that category ~40% timeouts), and surfaces mitigation levers (~74% auto-detected; ~28% required hotfix). Beyond hotfixes, many incidents were mitigated via traffic routing, node rebalancing, or capacity increase policies, indicating further automation opportunities. We also show how the taxonomy guided targeted strategies such as connection liveness, GPU capacity-aware routing, and per-endpoint isolation and reduced incident impact and accelerated recovery. Finally, we contribute a practitioner-oriented adoption checklist that enables others to replicate our taxonomy, analysis, and automation opportunities in their own systems. This study demonstrates how systematic, empirically grounded analysis of inference operations can drive more reliable and cost-efficient LLM serving at scale.
format Preprint
id arxiv_https___arxiv_org_abs_2511_07424
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Enhancing reliability in AI inference services: An empirical study on real production incidents
Ranganathan, Bhala
Zhang, Mickey
Wu, Kai
Distributed, Parallel, and Cluster Computing
Computers and Society
Hyperscale large language model (LLM) inference places extraordinary demands on cloud systems, where even brief failures can translate into significant user and business impact. To better understand and mitigate these risks, we present one of the first provider-internal, practice-based analysis of LLM inference incidents. We developed a taxonomy and methodology grounded in a year of operational experience, validating it on 156 high-severity incidents, and conducted a focused quantitative study of Apr-Jun 2025 to ensure recency and relevance. Our approach achieves high labeling consistency (Cohen's K ~0.89), identifies dominant failure modes (in our dataset ~60% inference engine failures, within that category ~40% timeouts), and surfaces mitigation levers (~74% auto-detected; ~28% required hotfix). Beyond hotfixes, many incidents were mitigated via traffic routing, node rebalancing, or capacity increase policies, indicating further automation opportunities. We also show how the taxonomy guided targeted strategies such as connection liveness, GPU capacity-aware routing, and per-endpoint isolation and reduced incident impact and accelerated recovery. Finally, we contribute a practitioner-oriented adoption checklist that enables others to replicate our taxonomy, analysis, and automation opportunities in their own systems. This study demonstrates how systematic, empirically grounded analysis of inference operations can drive more reliable and cost-efficient LLM serving at scale.
title Enhancing reliability in AI inference services: An empirical study on real production incidents
topic Distributed, Parallel, and Cluster Computing
Computers and Society
url https://arxiv.org/abs/2511.07424