Saved in:
Bibliographic Details
Main Author: Auch, Maximilian
Format: Recurso digital
Language:
Published: Zenodo 2025
Subjects:
Online Access:https://doi.org/10.5281/zenodo.17846204
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866902040245960704
author Auch, Maximilian
author_facet Auch, Maximilian
contents <h2><strong>MATILDA Design Decision Dataset (2010–2023)</strong></h2> <p>This dataset comprises historical development data and design decisions from the Java ecosystem, extracted from publicly available GitHub repositories. The focus lies on the evolution of software dependencies and the identification of technology migrations based on build configuration files (Maven pom.xml).</p> <p><strong>Data Basis and Scope</strong><br>The dataset covers the period from 2010 to 2023 and is based on an analysis of approximately 180,000 software projects. The data basis includes:</p> <ul> <li>3.1 million revisions with complete version history.</li> <li>25.7 million analyzed software components, classified using RNN-based methods.</li> <li>114,202 software projects from which valid design decisions could be extracted.</li> </ul> <p><strong>Extracted Decisions</strong><br>By comparing revision states, changes in the technology stack were identified and categorized:</p> <ul> <li>1.55 million design decisions in total (adding or removing libraries).</li> <li>136,472 migration decisions (8.8%), where a technology was replaced by a functional alternative.</li> <li>74 library categories, with a special focus on databases, application servers, UI frameworks, and messaging systems.</li> </ul> <p><strong>Structure and Formats</strong><br>The data is available in three processing stages:</p> <ul> <li>Raw Data (MongoDB): Complete history including branches and README files.</li> <li>Relational Data (PostgreSQL): Normalized design and migration decisions.</li> <li>Graph Data (Neo4j): Modeling of 2.5 million revision nodes and their relationships to 140 technologies for analyzing migration paths.</li> </ul> <p><strong>Application Areas</strong><br>The dataset is suitable for empirical software engineering research, particularly for analyzing technology trends, investigating library migrations, and training recommendation systems in the field of software architecture.</p>
format Recurso digital
id zenodo_https___doi_org_10_5281_zenodo_17846204
institution Zenodo
language
publishDate 2025
publisher Zenodo
record_format zenodo
spellingShingle MATILDA: Crawled and preprocessed Data to identify library-related decision alternatives
Auch, Maximilian
Github
Design Decisions
Java Library Usage
<h2><strong>MATILDA Design Decision Dataset (2010–2023)</strong></h2> <p>This dataset comprises historical development data and design decisions from the Java ecosystem, extracted from publicly available GitHub repositories. The focus lies on the evolution of software dependencies and the identification of technology migrations based on build configuration files (Maven pom.xml).</p> <p><strong>Data Basis and Scope</strong><br>The dataset covers the period from 2010 to 2023 and is based on an analysis of approximately 180,000 software projects. The data basis includes:</p> <ul> <li>3.1 million revisions with complete version history.</li> <li>25.7 million analyzed software components, classified using RNN-based methods.</li> <li>114,202 software projects from which valid design decisions could be extracted.</li> </ul> <p><strong>Extracted Decisions</strong><br>By comparing revision states, changes in the technology stack were identified and categorized:</p> <ul> <li>1.55 million design decisions in total (adding or removing libraries).</li> <li>136,472 migration decisions (8.8%), where a technology was replaced by a functional alternative.</li> <li>74 library categories, with a special focus on databases, application servers, UI frameworks, and messaging systems.</li> </ul> <p><strong>Structure and Formats</strong><br>The data is available in three processing stages:</p> <ul> <li>Raw Data (MongoDB): Complete history including branches and README files.</li> <li>Relational Data (PostgreSQL): Normalized design and migration decisions.</li> <li>Graph Data (Neo4j): Modeling of 2.5 million revision nodes and their relationships to 140 technologies for analyzing migration paths.</li> </ul> <p><strong>Application Areas</strong><br>The dataset is suitable for empirical software engineering research, particularly for analyzing technology trends, investigating library migrations, and training recommendation systems in the field of software architecture.</p>
title MATILDA: Crawled and preprocessed Data to identify library-related decision alternatives
topic Github
Design Decisions
Java Library Usage
url https://doi.org/10.5281/zenodo.17846204