Saved in:
Bibliographic Details
Main Authors: Oliveira, Wendell, Paiva, Filipe, Pereira, Thiago Emmanuel, Brunet, João
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2505.01568
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866916798564139008
author Oliveira, Wendell
Paiva, Filipe
Pereira, Thiago Emmanuel
Brunet, João
author_facet Oliveira, Wendell
Paiva, Filipe
Pereira, Thiago Emmanuel
Brunet, João
contents Background: As Infrastructure as Code (IaC) becomes standard practice, ensuring the reliability of IaC scripts is essential. Defect taxonomies are valuable tools for this, offering a common language for issues and enabling systematic tracking. A significant prior study developed such a taxonomy, but based it exclusively on the declarative language Puppet. It remained unknown whether this taxonomy applies to programming language-based IaC (PL-IaC) tools like Pulumi, Terraform CDK, and AWS CDK. Aim: We replicated this foundational work to assess the generalizability of the taxonomy across a broader and more diverse landscape. Method: We performed qualitative analysis on 3,364 defect-related commits from 285 open-source PL-IaC repositories (PIPr dataset) to derive a PL-IaC-specific defect taxonomy. We then enhanced the ACID tool, originally developed for the prior study, to automatically classify and analyze defect distributions across an expanded dataset-447 open-source repositories and 94 proprietary projects from VTEX (e-commerce) and Nubank (financial). Results: Our research confirmed the same eight defect categories identified in the original study, with idempotency and security defects appearing infrequently but persistently across projects. Configuration Data defects maintain high frequency in both open-source and proprietary codebases. Conclusions: Our replication supports the generalizability of the original taxonomy, suggesting IaC development challenges surpass organizational boundaries. Configuration Data defects emerge as a persistent high-frequency problem, while idempotency and security defects remain important concerns despite lower frequency. These patterns appear consistent across open-source and proprietary projects, indicating they are fundamental to the IaC paradigm itself, transcending specific tools or project types.
format Preprint
id arxiv_https___arxiv_org_abs_2505_01568
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle A Defect Taxonomy for Infrastructure as Code: A Replication Study
Oliveira, Wendell
Paiva, Filipe
Pereira, Thiago Emmanuel
Brunet, João
Software Engineering
Background: As Infrastructure as Code (IaC) becomes standard practice, ensuring the reliability of IaC scripts is essential. Defect taxonomies are valuable tools for this, offering a common language for issues and enabling systematic tracking. A significant prior study developed such a taxonomy, but based it exclusively on the declarative language Puppet. It remained unknown whether this taxonomy applies to programming language-based IaC (PL-IaC) tools like Pulumi, Terraform CDK, and AWS CDK. Aim: We replicated this foundational work to assess the generalizability of the taxonomy across a broader and more diverse landscape. Method: We performed qualitative analysis on 3,364 defect-related commits from 285 open-source PL-IaC repositories (PIPr dataset) to derive a PL-IaC-specific defect taxonomy. We then enhanced the ACID tool, originally developed for the prior study, to automatically classify and analyze defect distributions across an expanded dataset-447 open-source repositories and 94 proprietary projects from VTEX (e-commerce) and Nubank (financial). Results: Our research confirmed the same eight defect categories identified in the original study, with idempotency and security defects appearing infrequently but persistently across projects. Configuration Data defects maintain high frequency in both open-source and proprietary codebases. Conclusions: Our replication supports the generalizability of the original taxonomy, suggesting IaC development challenges surpass organizational boundaries. Configuration Data defects emerge as a persistent high-frequency problem, while idempotency and security defects remain important concerns despite lower frequency. These patterns appear consistent across open-source and proprietary projects, indicating they are fundamental to the IaC paradigm itself, transcending specific tools or project types.
title A Defect Taxonomy for Infrastructure as Code: A Replication Study
topic Software Engineering
url https://arxiv.org/abs/2505.01568