Saved in:
Bibliographic Details
Main Authors: Liu, Dewei, He, Chuan, Peng, Xin, Lin, Fan, Zhang, Chenxi, Gong, Shengfang, Li, Ziang, Ou, Jiayu, Wu, Zheshun
Format: Preprint
Published: 2021
Subjects:
Online Access:https://arxiv.org/abs/2103.01782
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866912250909949952
author Liu, Dewei
He, Chuan
Peng, Xin
Lin, Fan
Zhang, Chenxi
Gong, Shengfang
Li, Ziang
Ou, Jiayu
Wu, Zheshun
author_facet Liu, Dewei
He, Chuan
Peng, Xin
Lin, Fan
Zhang, Chenxi
Gong, Shengfang
Li, Ziang
Ou, Jiayu
Wu, Zheshun
contents Availability issues of industrial microservice systems (e.g., drop of successfully placed orders and processed transactions) directly affect the running of the business. These issues are usually caused by various types of service anomalies which propagate along service dependencies. Accurate and high-efficient root cause localization is thus a critical challenge for large-scale industrial microservice systems. Existing approaches use service dependency graph based analysis techniques to automatically locate root causes. However, these approaches are limited due to their inaccurate detection of service anomalies and inefficient traversing of service dependency graph. In this paper, we propose a high-efficient root cause localization approach for availability issues of microservice systems, called MicroHECL. Based on a dynamically constructed service call graph, MicroHECL analyzes possible anomaly propagation chains, and ranks candidate root causes based on correlation analysis. We combine machine learning and statistical methods and design customized models for the detection of different types of service anomalies (i.e., performance, reliability, traffic). To improve the efficiency, we adopt a pruning strategy to eliminate irrelevant service calls in anomaly propagation chain analysis. Experimental studies show that MicroHECL significantly outperforms two state-of-the-art baseline approaches in terms of both accuracy and efficiency. MicroHECL has been used in Alibaba and achieves a top-3 hit ratio of 68% with root cause localization time reduced from 30 minutes to 5 minutes.
format Preprint
id arxiv_https___arxiv_org_abs_2103_01782
institution arXiv
publishDate 2021
record_format arxiv
spellingShingle MicroHECL: High-Efficient Root Cause Localization in Large-Scale Microservice Systems
Liu, Dewei
He, Chuan
Peng, Xin
Lin, Fan
Zhang, Chenxi
Gong, Shengfang
Li, Ziang
Ou, Jiayu
Wu, Zheshun
Software Engineering
Availability issues of industrial microservice systems (e.g., drop of successfully placed orders and processed transactions) directly affect the running of the business. These issues are usually caused by various types of service anomalies which propagate along service dependencies. Accurate and high-efficient root cause localization is thus a critical challenge for large-scale industrial microservice systems. Existing approaches use service dependency graph based analysis techniques to automatically locate root causes. However, these approaches are limited due to their inaccurate detection of service anomalies and inefficient traversing of service dependency graph. In this paper, we propose a high-efficient root cause localization approach for availability issues of microservice systems, called MicroHECL. Based on a dynamically constructed service call graph, MicroHECL analyzes possible anomaly propagation chains, and ranks candidate root causes based on correlation analysis. We combine machine learning and statistical methods and design customized models for the detection of different types of service anomalies (i.e., performance, reliability, traffic). To improve the efficiency, we adopt a pruning strategy to eliminate irrelevant service calls in anomaly propagation chain analysis. Experimental studies show that MicroHECL significantly outperforms two state-of-the-art baseline approaches in terms of both accuracy and efficiency. MicroHECL has been used in Alibaba and achieves a top-3 hit ratio of 68% with root cause localization time reduced from 30 minutes to 5 minutes.
title MicroHECL: High-Efficient Root Cause Localization in Large-Scale Microservice Systems
topic Software Engineering
url https://arxiv.org/abs/2103.01782