Saved in:
| Hovedforfatter: | |
|---|---|
| Format: | Recurso digital |
| Sprog: | |
| Udgivet: |
Zenodo
2026
|
| Online adgang: | https://doi.org/10.5281/zenodo.20327574 |
| Tags: |
Tilføj Tag
Ingen Tags, Vær først til at tagge denne postø!
|
Indholdsfortegnelse:
- <h1>Replication Package</h1> <h2>On the Accuracy of Issue Localization by Coding Agents</h2> <h2>1. Overview</h2> <p>We empirically evaluate how accurately three open-weight<br>LLM-based coding agents identify the software entities that<br>human developers modify to resolve real-world issues. The<br>study spans 10 Apache Java projects and 2,441 issues, with<br>metrics computed at three granularities: package, class, and<br>method.</p> <p>The pipeline reconstructs each issue's historical pre-fix<br>state, invokes each agent on the issue description, and<br>compares the agent-modified entities against the human ground<br>truth using Precision, Recall, F1, Accuracy, and MCC.</p> <h2>2. Repository Structure</h2> <pre><code>. ├── README.md # this file ├── requirements.txt # Python dependencies ├── data/ │ ├── raw/ # SQuaD dataset (input CSV files) │ ├── interim/ # filtered issues per project (parquet) │ │ ├── study_projects.csv # List of the 10 selected projects │ │ ├── pilot_issues__<project>.parquet # Initial pool of issues per project (before size filtering) │ │ └── pilot_final__<project>.parquet # Sampled issues per project (after size filtering) │ ├── processed/ │ │ ├── ground_truth/ # human-fix entities per issue │ │ │ └── <project>/ │ │ │ │ └── <issue>.json # Package/class/method entities modified by the human in the HIFC │ │ ├── agent_outputs/ # raw patch.diff + metadata per (model, issue) │ │ │ └── <model>/ │ │ │ └── <project>__<issue>/ │ │ │ ├── patch.diff # Generated patch │ │ │ ├── run_metadata.json # Timing, exit code, timeout flag │ │ │ ├── stdout.log # OpenCode standard output │ │ │ └── stderr.log # OpenCode standard error │ │ ├── agent_entities/ # Parsed entities from each agent patch (output of stage 07) │ │ │ └── <model>/ │ │ │ └── <project>/ │ │ │ └── <issue>.json # Package/class/method entities modified by the agent │ │ └── metrics/ # P/R/F1 per (model, project, granularity) │ │ ├── _common/ # Issue keys completed by all three models for a given project (output of stage 09) │ │ │ ├── common_issues__<project>.csv │ │ │ └── all_metrics.csv # Union of all per-issue metrics across every model, project, and granularity │ │ ├── _global/ # Final aggregated results (output of stage 10) │ │ │ ├── per_model.csv # Macro-averaged P/R/F1/Acc per (model, granularity) │ │ │ ├── per_project.csv # Per-project breakdown │ │ │ └── cross.csv # Cross-model averages │ │ └── <model>/ # Per-model metrics (one folder per LLM) │ │ ├── per_issue__<project>.csv # Per-issue metrics computed by stage 08 (all valid runs) │ │ └── per_issue_filtered__<project>.csv # Subset restricted to issues completed by all three models │ ├── repos/ # cloned Apache repositories │ │ └── apache#<project>/ │ └── workspaces/ # transient git worktrees (created at runtime) ├── scripts/ │ ├── 00_download_csv_SQuaD.sh # Download SQuaD source files │ ├── 01_filter_and_link.py # Build the DuckDB database from SQuaD files │ ├── 02_clone_repos.py # Clone Apache repositories listed in the file data/interim/study_projects.csv │ ├── 03_select_pilot_subset.py # Select the issues for each project and create the pilot_issues_<project>.parquet file │ ├── 04_extract_ground_truth.py # For each issue in pilot_issues_<project>.parquet, compute the human ground truth │ ├── 04b_filter_pilot_for_size.py # [Not used] Filters selected issues and creates the `pilot_final_<project>.parquet` file │ ├── 05_run_one_issue.py # Manual single-shot agent run for one (project, issue) pair │ ├── 06_run_pilot.py # Orchestrate the agent runs over all issues in pilot_final__<project>.parquet │ ├── 07_extract_aifc_entities.py # Parse agent patch with tree-sitter │ ├── 08_compute_metrics.py # Compute TP/FP/FN/TN, P/R/F1, Accuracy, MCC per issue │ ├── 09_filter_common_issues.py # Identify the intersection of valid issues across all 3 models │ ├── 10_aggregate_metrics.py # Aggregate metrics across projects and models │ ├── run_multi_project.py # orchestrator │ └── count_divergent_cases.py # Count structurally divergent cases (potentially functionally equivalent) ├── llm_selection/ │ ├── artificialanalysis.csv # LLM ranking from artificialanalysis.ai │ ├── llmstats.csv # LLM ranking from llmstats.com │ ├── opencompass.csv # LLM ranking from OpenCompass leaderboard │ ├── llm_selection.py # Aggregates the three rankings and produces the final ranked list │ └── final_ranking.csv # Final ranked LLMs; top 3 open-weight models are used in the study └── project_selection/ ├── all_java_projects.csv # Candidate pool of Java projects from SQuaD ├── list_projects.py # Selects Java projects from SQuaD; writes all_java_projects.csv ├── projects_selection.py # Ranks candidates by 14 metrics; writes the two CSVs below and study_projects.csv ├── projects_at_least_9_median.csv # Projects above the median on ≥9 of 14 metrics ├── projects_at_least_9_q3.csv # Projects above the third quartile on ≥9 of 14 metrics (used for the study) └── count_final_issues.py # Computes Cochran's sample size from the number of filtered issues </code></pre> <h2>3. Requirements</h2> <h3>System</h3> <ul> <li><strong>OS</strong>: Linux (Ubuntu 22.04 or later recommended)</li> <li><strong>Python</strong>: 3.10 or later</li> <li><strong>Git</strong>: 2.43 or later</li> </ul> <h3>Python dependencies</h3> <pre><code class="language-bash">python -m venv venv source venv/bin/activate pip install -r requirements.txt </code></pre> <h3>Agent harness</h3> <p>The Ollama Cloud API key must be configured in <strong>two places</strong>:</p> <p><strong>1. As an environment variable</strong> (used by the pipeline scripts):</p> <pre><code class="language-bash">export OLLAMA_API_KEY="" </code></pre> <p><strong>2. In the OpenCode configuration file</strong> (used by the agent harness):</p> <pre><code class="language-bash"># Edit OpenCode's config file mkdir -p ~/.config/opencode cat > ~/.config/opencode/opencode.json << 'EOF' { "provider": { "ollama-cloud": { "api_key": "<your_key>" } } } EOF </code></pre> <p>Replace <code><your_key></code> with your Ollama Cloud API key in both<br>locations.</p> <p>The three open-weight LLMs evaluated in the paper are pulled by<br>Ollama at runtime:</p> <ul> <li><code>ollama-cloud/glm-5:cloud</code></li> <li><code>ollama-cloud/kimi-k2.5</code></li> <li><code>ollama-cloud/qwen3.5:397b</code></li> </ul> <h2>4. LLM Selection</h2> <p>The three LLMs evaluated in the study (GLM-5, Kimi-K2.5,<br>Qwen3.5:397B) were selected through a rank-aggregation<br>procedure implemented in <code>llm_selection/</code>:</p> <p><strong>Inputs.</strong> Three public leaderboards ranking open-weight<br>LLMs on coding-related tasks, collected on March 2026.<br>Each leaderboard provides an ordered list of models from<br>best to worst:</p> <ul> <li><code>artificialanalysis.csv</code> (artificialanalysis.ai)</li> <li><code>llmstats.csv</code> (llmstats.com)</li> <li><code>opencompass.csv</code> (OpenCompass leaderboard)</li> </ul> <p><strong>Aggregation.</strong> <code>llm_selection.py</code> combines the three<br>rankings and produces <code>final_ranking.csv</code>.</p> <p>The top-3 open-weight LLMs from <code>final_ranking.csv</code> are<br>used to instantiate the three agents evaluated in the study.</p> <p>To reproduce the selection:</p> <pre><code class="language-bash">python llm_selection/llm_selection.py </code></pre> <h2>5. Project Selection</h2> <p>The 10 Apache Java projects evaluated in the study were<br>selected through a two-step procedure implemented in<br><code>project_selection/</code>:</p> <p><strong>Step 1 — Candidate pool.</strong> <code>list_projects.py</code> queries the<br>SQuaD database and retains every Java project on Apache JIRA<br>with a valid commit/issue linkage and a non-trivial number of<br>closed bug-fix commits. The output is <code>all_java_projects.csv</code>.</p> <p><strong>Step 2 — Ranking.</strong> <code>projects_selection.py</code> evaluates each<br>candidate against 14 size- and maturity-related metrics drawn<br>from SQuaD. Two ranked subsets are produced:</p> <ul> <li><code>projects_at_least_9_median.csv</code>: projects above the median<br>on at least 9 of 14 metrics (lenient ranking)</li> <li><code>projects_at_least_9_q3.csv</code>: projects above the third<br>quartile on at least 9 of 14 metrics (strict ranking)</li> </ul> <p>The strict ranking is used to select the 10 top-ranked<br>projects analyzed in the study, written to<br><code>data/interim/study_projects.csv</code>.</p> <p><strong>Step 3 — Sample size.</strong> <code>count_final_issues.py</code> reads the<br>filtered issue counts per selected project and computes the<br>target sample size using Cochran's formula (95% confidence,<br>5% margin of error). The output guides the sampling stage of<br>the main pipeline (<code>03_select_pilot_subset.py</code>).</p> <p>To reproduce the selection:</p> <pre><code class="language-bash">python project_selection/list_projects.py --language Java --output all_java_projects.csv python project_selection/projects_selection.py python project_selection/count_final_issues.py --input data/interim/study_projects.csv </code></pre> <h2>6. Execution Order</h2> <p>The pipeline is organized as a sequence of numbered stages.<br>The early stages (<code>00</code>–<code>04</code>) prepare the data, the central<br>stage (<code>run_multi_project.py</code>) executes the agents on each<br>project, and the final stages (<code>09</code>–<code>10</code>) aggregate the<br>results.</p> <h3>6.1 Full pipeline (recommended)</h3> <h4>Step 1 — Data preparation</h4> <pre><code class="language-bash">bash scripts/00_download_csv_SQuaD.sh python scripts/01_filter_and_link.py python scripts/02_clone_repos.py python scripts/run_multi_project.py \ --projects-file data/interim/study_projects.csv \ --skip-04b \ --only-stages "03_select_pilot" "04_ground_truth" \ --continue-on-error </code></pre> <h4>Step 2 — Agent runs</h4> <p>The orchestrator <code>run_multi_project.py</code> invokes one LLM at a<br>time over all selected projects. Run it once per model:</p> <pre><code class="language-bash">for model in "ollama-cloud/glm-5:cloud" \ "ollama-cloud/kimi-k2.5" \ "ollama-cloud/qwen3.5:397b"; do python scripts/run_multi_project.py \ --projects-file data/interim/study_projects.csv \ --model "$model" \ --skip-prep \ --skip-04b \ --timeout 600 \ --continue-on-error done </code></pre> <p>For each (model, project) pair, the orchestrator runs in order:</p> <ol> <li><code>06_run_pilot.py</code>: invokes the agent on every issue in<br><code>pilot_final__<project>.parquet</code> and saves<br><code>patch.diff</code> + <code>run_metadata.json</code> into<br><code>data/processed/agent_outputs/<model>/<project>__<issue>/</code></li> <li><code>07_extract_aifc_entities.py</code>: parses each agent patch and<br>extracts the modified entities at package, class, and<br>method granularity</li> <li><code>08_compute_metrics.py</code>: computes TP/FP/FN/TN, Precision,<br>Recall, F1, Accuracy, and MCC per (issue, granularity),<br>and writes<br><code>data/processed/metrics/<model>/per_issue__<project>.csv</code></li> </ol> <p>Each agent run is capped at a 600-second timeout. If a run exceeds the timeout,<br>the same script must be restarted with the timeout parameter increased to 1200 seconds.<br>Validity filtering (exit code 0 and not timed out) is applied at metric-computation time.</p> <h4>Step 3 — Cross-model aggregation</h4> <p>After all three models have completed, restrict the analysis<br>to the intersection of valid issues, then aggregate:</p> <pre><code class="language-bash">python scripts/09_filter_common_issues.py \ --models ollama-cloud/glm-5:cloud \ ollama-cloud/kimi-k2.5 \ ollama-cloud/qwen3.5:397b python scripts/10_aggregate_metrics.py \ --models ollama-cloud/glm-5:cloud \ ollama-cloud/kimi-k2.5 \ ollama-cloud/qwen3.5:397b </code></pre> <p>The aggregated CSVs in <code>_global/</code> correspond to the numbers<br>reported in Table 1 of the paper.</p> <h3>6.3 Running a single issue (debugging)</h3> <p>To experiment with a single (project, issue, model)<br>combination — useful for testing prompts or harness behavior:</p> <pre><code class="language-bash">python scripts/prova_05_run_one_issue.py \ --project "apache#hudi" \ --issue HUDI-990 \ --model ollama-cloud/kimi-k2.5 </code></pre> <p>This script does not write to the final metrics; its output<br>lives in <code>data/processed/agent_outputs/</code> and can be inspected<br>manually.</p> <h3>6.4 Running a single project</h3> <p>To run a single project across all stages (without iterating<br>over all 10), pass <code>--projects</code> instead of <code>--projects-file</code>:</p> <pre><code class="language-bash">python scripts/run_multi_project.py \ --projects "apache#hbase" \ --model ollama-cloud/glm-5:cloud --skip-prep \ --skip-04b \ --timeout 600 \ --continue-on-error </code></pre> <h2>7 Divergent Cases Analysis</h2> <p>The script <code>count_divergent_cases.py</code> quantifies the number of <strong>structurally divergent<br>cases</strong> in the localization results, defined as issues for which the agent simultaneously<br>introduced modifications outside the human ground truth (<code>fp > 0</code>) and missed entities<br>that the developer actually touched (<code>fn > 0</code>). Structural divergence is identified<br>deterministically from the confusion-matrix counts and does not require manual inspection.</p> <p>These cases are relevant for the construct-validity discussion of the study: a subset<br>of them may correspond to <strong>functionally equivalent implementations</strong> that overlap-based<br>metrics (Accuracy, Precision, Recall, F1, MCC) cannot capture, since the agent may resolve<br>the issue through differently named or structurally distinct entities. Counting them<br>therefore provides a conservative upper bound on how many evaluations could potentially<br>be re-classified once functional equivalence is assessed through manual inspection or<br>semantic analysis — a step left to future work.</p> <p><strong>Input.</strong> The script expects a single CSV file with one row per (model, project, issue,<br>granularity) combination, containing the columns: <code>project, model, issue_key, granularity, n_human, n_agent, tp, fp, fn, tn, accuracy, precision, recall, f1, mcc</code>.</p> <p>In our pipeline, per-issue metrics are produced by <code>09_filter_common_issues.py</code> and stored under<br><code>data/processed/metrics/<model>/per_issue_filtered__apache__<project>.csv</code>, i.e. one file<br>per (model, project) pair. Before running the divergent-cases analysis, these files must<br>be concatenated into a single consolidated CSV. This can be done with a one-liner:</p> <pre><code class="language-bash">awk 'FNR==1 && NR!=1 { next } { print }' \ data/processed/metrics/ollama-cloud__*/per_issue_filtered__apache__*.csv \ > data/processed/metrics/_common/all_metrics.csv </code></pre> <p>The resulting <code>all_metrics.csv</code> is the input to <code>count_divergent_cases.py</code></p> <p><strong>Output.</strong> A summary printed to standard output reporting (i) the total number and<br>percentage of structurally divergent rows, and breakdowns by (ii) <code>model</code>,<br>(iii) <code>granularity</code>, (iv) <code>project</code>, and (v) the cross-tabulation <code>model × granularity</code>.<br>Optionally, the script exports the subset of divergent rows to a separate CSV for<br>further manual inspection.</p> <pre><code class="language-bash">awk 'FNR==1 && NR!=1 { next } { print }' \ data/processed/metrics/ollama-cloud__*/per_issue_filtered__apache__*.csv \ > data/processed/metrics/_common/all_metrics.csv python scripts/count_divergent_cases.py \ data/processed/metrics/_common/all_metrics.csv python scripts/count_divergent_cases.py \ data/processed/metrics/_common/all_metrics.csv \ --output data/processed/metrics/_common/divergent_rows.csv </code></pre> <h2>8 License</h2> <p>This replication package contains material distributed under<br>multiple licenses, depending on its origin:</p> <table> <tbody> <tr> <th>Component</th> <th>License</th> <th>File</th> </tr> </tbody> <tbody> <tr> <td>Code (Python scripts, shell scripts)</td> <td>MIT License</td> <td><code>SCRIPT_LICENSE</code></td> </tr> <tr> <td>Derived data and aggregated results</td> <td>CC-BY 4.0</td> <td><code>DATA_LICENSE</code></td> </tr> <tr> <td>Third-party material (Apache projects, SQuaD, OpenCode, LLMs)</td> <td>Retained under original licenses</td> <td><code>NOTICE.md</code></td> </tr> </tbody> </table> <p>See <code>NOTICE.md</code> for full attribution and third-party license<br>information.</p>