Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Barletta, Marco, Cinque, Marcello, Di Martino, Catello, Kalbarczyk, Zbigniew T., Iyer, Ravishankar K.
Format:	Preprint
Published:	2024
Subjects:	Distributed, Parallel, and Cluster Computing
Online Access:	https://arxiv.org/abs/2404.11169
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913501025402880
author	Barletta, Marco Cinque, Marcello Di Martino, Catello Kalbarczyk, Zbigniew T. Iyer, Ravishankar K.
author_facet	Barletta, Marco Cinque, Marcello Di Martino, Catello Kalbarczyk, Zbigniew T. Iyer, Ravishankar K.
contents	In this paper, we i) analyze and classify real-world failures of Kubernetes (the most popular container orchestration system), ii) develop a framework to perform a fault/error injection campaign targeting the data store preserving the cluster state, and iii) compare results of our fault/error injection experiments with real-world failures, showing that our fault/error injections can recreate many real-world failure patterns. The paper aims to address the lack of studies on systematic analyses of Kubernetes failures to date. Our results show that even a single fault/error (e.g., a bit-flip) in the data stored can propagate, causing cluster-wide failures (3% of injections), service networking issues (4%), and service under/overprovisioning (24%). Errors in the fields tracking dependencies between object caused 51% of such cluster-wide failures. We argue that controlled fault/error injection-based testing should be employed to proactively assess Kubernetes' resiliency and guide the design of failure mitigation strategies.
format	Preprint
id	arxiv_https___arxiv_org_abs_2404_11169
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Mutiny! How does Kubernetes fail, and what can we do about it? Barletta, Marco Cinque, Marcello Di Martino, Catello Kalbarczyk, Zbigniew T. Iyer, Ravishankar K. Distributed, Parallel, and Cluster Computing In this paper, we i) analyze and classify real-world failures of Kubernetes (the most popular container orchestration system), ii) develop a framework to perform a fault/error injection campaign targeting the data store preserving the cluster state, and iii) compare results of our fault/error injection experiments with real-world failures, showing that our fault/error injections can recreate many real-world failure patterns. The paper aims to address the lack of studies on systematic analyses of Kubernetes failures to date. Our results show that even a single fault/error (e.g., a bit-flip) in the data stored can propagate, causing cluster-wide failures (3% of injections), service networking issues (4%), and service under/overprovisioning (24%). Errors in the fields tracking dependencies between object caused 51% of such cluster-wide failures. We argue that controlled fault/error injection-based testing should be employed to proactively assess Kubernetes' resiliency and guide the design of failure mitigation strategies.
title	Mutiny! How does Kubernetes fail, and what can we do about it?
topic	Distributed, Parallel, and Cluster Computing
url	https://arxiv.org/abs/2404.11169

Similar Items