Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Martinez, Matias, Franch, Xavier
Format:	Preprint
Published:	2026
Subjects:	Software Engineering
Online Access:	https://arxiv.org/abs/2602.04449
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910011556364288
author	Martinez, Matias Franch, Xavier
author_facet	Martinez, Matias Franch, Xavier
contents	The rapid progress in Automated Program Repair (APR) has been fueled by advances in AI, particularly large language models (LLMs) and agent-based systems. SWE-Bench is a benchmark designed to evaluate repair systems using real issues mined from popular open-source Python repositories. Its public leaderboards-SWE-Bench Lite and Verified-have become central platforms for tracking progress and comparing solutions. In this paper, we present the first comprehensive study of these two leaderboards, examining who is submitting solutions, the products behind the submissions, the LLMs employed, and the openness of the approaches. We analyze 79 entries submitted to Lite leaderboard and 133 to Verified. Our results show that most entries on both leaderboards originate from industry, particularly small companies and large publicly traded companies. These submissions often achieve top results, although academic contributions-typically open source-also remain competitive. We also find a clear dominance of proprietary LLMs, especially Claude family, with state-of-the-art results on both leaderboards currently achieved by Claude 4 Sonnet. These findings offer insights into the SWE-Bench ecosystem that can guide greater transparency and diversity in future benchmark-driven research.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_04449
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	What's in a Benchmark? The Case of SWE-Bench in Automated Program Repair Martinez, Matias Franch, Xavier Software Engineering The rapid progress in Automated Program Repair (APR) has been fueled by advances in AI, particularly large language models (LLMs) and agent-based systems. SWE-Bench is a benchmark designed to evaluate repair systems using real issues mined from popular open-source Python repositories. Its public leaderboards-SWE-Bench Lite and Verified-have become central platforms for tracking progress and comparing solutions. In this paper, we present the first comprehensive study of these two leaderboards, examining who is submitting solutions, the products behind the submissions, the LLMs employed, and the openness of the approaches. We analyze 79 entries submitted to Lite leaderboard and 133 to Verified. Our results show that most entries on both leaderboards originate from industry, particularly small companies and large publicly traded companies. These submissions often achieve top results, although academic contributions-typically open source-also remain competitive. We also find a clear dominance of proprietary LLMs, especially Claude family, with state-of-the-art results on both leaderboards currently achieved by Claude 4 Sonnet. These findings offer insights into the SWE-Bench ecosystem that can guide greater transparency and diversity in future benchmark-driven research.
title	What's in a Benchmark? The Case of SWE-Bench in Automated Program Repair
topic	Software Engineering
url	https://arxiv.org/abs/2602.04449

Similar Items