Saved in:
Bibliographic Details
Main Author: Anonymous
Format: Recurso digital
Language:
Published: Zenodo 2025
Online Access:https://doi.org/10.5281/zenodo.15544258
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • <div> <div># History Alterations - Replication Package</div> <br> <div>This repository contains the complete replication package for the research article *Altered Histories in Version Control System Repositories: Evidence from the Trenches*. The package provides tools to detect, analyze, and categorize Git history alterations across software repositories, along with Jupyter notebooks to reproduce the analysis presented in the paper.</div> <br> <div>## Table of Contents</div> <br> <div>- [Overview](#overview)</div> <div>- [Repository Structure](#repository-structure)</div> <div>- [Quick Start](#quick-start)</div> <div>- [Reproducing the Analysis](#reproducing-the-analysis)</div> <div>- [Data](#data)</div> <div>- [Tools Description](#tools-description)</div> <div>- [Requirements](#requirements)</div> <div>- [Citation](#citation)</div> <br> <div>## Overview</div> <br> <div>This replication package enables researchers to reproduce the analysis of altered Git histories in software repositories archived by [Software Heritage](https://www.softwareheritage.org/). The study investigates how and why Git histories are modified over time, providing insights into developer practices and repository maintenance patterns.</div> <br> <div>**Main Research Questions:**</div> <br> <div>- How prevalent are Git history alterations in open-source repositories?</div> <div>- What types of changes are most commonly made to Git histories?</div> <div>- What are the root causes of these alterations?</div> <div>- How do these practices vary across different types of repositories?</div> <br> <div>## Repository Structure</div> <br> <div>```</div> <div>├── README.md # This file</div> <div>├── data/ # Pre-computed datasets</div> <div>│ ├── ...</div> <div>├── altered-history/ # Main analysis tool</div> <div>│ ├── src/ # Rust source code</div> <div>│ ├── notebooks/ # Analysis notebooks</div> <div>│ │ ├── analysis.ipynb # Main analysis notebook</div> <div>│ │ ├── build_analysis_dataset.ipynb</div> <div>│ │ └── utils_analysis.py # Analysis utilities</div> <div>│ └── README.md</div> <div>├── git-historian/ # History checking tool</div> <div>│ ├── src/ # Rust source code</div> <div>│ └── README.md</div> <div>├── modified-files/ # File modification analysis tool</div> <div>│ ├── src/ # Rust source code</div> <div>│ ├── notebooks/ # Additional analysis notebooks</div> <div>│ │ ├── license_analysis.ipynb</div> <div>│ │ ├── license_categorization.py</div> <div>│ │ ├── secret-analysis.ipynb</div> <div>│ │ └── swh_license_files.py</div> <div>│ └── README.md</div> <div>```</div> <br> <div>## Quick Start</div> <br> <div>### Prerequisites</div> <br> <div>- **Rust** (latest stable version)</div> <div>- **Python 3.8+** with Jupyter</div> <div>- **PostgreSQL** (for database operations)</div> <div>- **Git** (for repository analysis)</div> <br> <div>### Installation</div> <br> <div>1. **Clone the repository:**</div> <div>```bash</div> <div>git clone <repository-url></div> <div>cd altered-histories-tool-replication-pkg</div> <div>```</div> <br> <div>2. **Unzip all directories**</div> <br> <div>3. **Install Python dependencies:**</div> <div>```bash</div> <div>pip install pandas matplotlib seaborn jupyter plotly numpy</div> <div>```</div> <br> <div>4. **Build the Rust tools (optional, for dataset generation):**</div> <div>```bash</div> <div>cd altered-history && cargo build --release && cd ..</div> <div>cd git-historian && cargo build --release && cd ..</div> <div>cd modified-files && cargo build --release && cd ..</div> <div>```</div> <br> <div>## Reproducing the Analysis</div> <br> <div>### Option 1: Using Pre-computed Data (Recommended)</div> <br> <div>The `data/` directory contains pre-computed datasets that allow you to reproduce all analyses without running the computationally intensive data collection process.</div> <br> <div>1. **Open the main analysis notebook:**</div> <div>```bash</div> <div>cd altered-history/notebooks</div> <div>jupyter notebook analysis.ipynb</div> <div>```</div> <br> <div>2. **Run all cells** to reproduce the complete analysis.</div> <br> <div>3. **Explore additional analyses:**</div> <br> <div>Modify notebooks at will to explore the dataframe.</div> <div>```bash</div> <div># Build analysis dataset (shows data preparation)</div> <div>jupyter notebook build_analysis_dataset.ipynb</div> <div> </div> <div># License-related analysis</div> <div>cd ../../modified-files/notebooks</div> <div>jupyter notebook license_analysis.ipynb</div> <div> </div> <div># Security and secrets analysis</div> <div>jupyter notebook secret-analysis.ipynb</div> <div>```</div> <br> <div>### Option 2: Regenerating the Dataset</div> <br> <div>To reproduce the complete data collection and analysis pipeline:</div> <br> <div>1. **Download Software Heritage datasets** (see individual tool READMEs)</div> <div>2. **Configure database connections** in each tool</div> <div>3. **Run the analysis pipeline** following the step-by-step instructions in each tool's README</div> <div>4. **Process results** using the provided notebooks</div> <br> <div>**Note:** Complete dataset regeneration requires significant computational resources and time (potentially weeks for large datasets).</div> <br> <div>## Data</div> <br> <div>The `data/` directory contains several key datasets including:</div> <br> <div>- **`res.pkl`**: Main analysis results containing categorized alterations</div> <div>- **`stars_without_dup.pkl`**: Repository popularity metrics (GitHub stars)</div> <div>- **`visit_type.pkl`**: Classification of repository visit patterns</div> <div>- **`altered_histories_2024_08_23.dump`**: PostgreSQL database dump for git-historian tool</div> <br> <div>## ️ Tools Description</div> <br> <div>### 1. altered-history</div> <br> <div>**Purpose:** Detects and categorizes Git history alterations in Software Heritage archives.</div> <br> <div>**Key Features:**</div> <br> <div>- Three-step analysis pipeline (detection → root cause → categorization)</div> <div>- Parallel processing for large datasets</div> <div>- Comprehensive alteration taxonomy</div> <br> <div>**Usage:** See `altered-history/README.md` for detailed instructions.</div> <br> <div>### 2. git-historian</div> <br> <div>**Purpose:** Checks individual repositories against the database of known alterations.</div> <br> <div>**Key Features:**</div> <br> <div>- PostgreSQL integration</div> <div>- Git hook integration for automated checking</div> <div>- Caching system for performance</div> <br> <div>**Usage:** See `git-historian/README.md` for detailed instructions.</div> <br> <div>### 3. modified-files</div> <br> <div>**Purpose:** Analyzes file-level modifications and their patterns.</div> <br> <div>**Key Features:**</div> <br> <div>- File modification tracking</div> <div>- License and security analysis</div> <div>- Integration with Software Heritage graph</div> <br> <div>**Usage:** See `modified-files/README.md` for detailed instructions.</div> <br> <div>## Requirements</div> <br> <div>### System Requirements</div> <br> <div>- **Memory:** Minimum 16GB RAM (1.5TB+ recommended for full dataset processing)</div> <div>- **Storage:** 600GB+ free space for complete datasets</div> <div>- **CPU:** Multi-core processor recommended for parallel processing</div> <br> <div>## Reproducibility Notes</div> <br> <div>1. **Deterministic Results:** The analysis notebooks will produce identical results when run with the provided datasets.</div> <br> <div>2. **Versioning:** All tools are pinned to specific versions to ensure reproducibility.</div> <br> <div>3. **Random Seeds:** Where applicable, random seeds are fixed in the analysis code.</div> </div>