Enregistré dans:
Détails bibliographiques
Auteurs principaux: Zan, Daoguang, Huang, Zhirong, Liu, Wei, Chen, Hanwu, Zhang, Linhao, Xin, Shulin, Chen, Lu, Liu, Qi, Zhong, Xiaojian, Li, Aoyan, Liu, Siyao, Xiao, Yongsheng, Chen, Liangqiang, Zhang, Yuyu, Su, Jing, Liu, Tianyu, Long, Rui, Shen, Kai, Xiang, Liang
Format: Preprint
Publié: 2025
Sujets:
Accès en ligne:https://arxiv.org/abs/2504.02605
Tags: Ajouter un tag
Pas de tags, Soyez le premier à ajouter un tag!
_version_ 1866913775434596352
author Zan, Daoguang
Huang, Zhirong
Liu, Wei
Chen, Hanwu
Zhang, Linhao
Xin, Shulin
Chen, Lu
Liu, Qi
Zhong, Xiaojian
Li, Aoyan
Liu, Siyao
Xiao, Yongsheng
Chen, Liangqiang
Zhang, Yuyu
Su, Jing
Liu, Tianyu
Long, Rui
Shen, Kai
Xiang, Liang
author_facet Zan, Daoguang
Huang, Zhirong
Liu, Wei
Chen, Hanwu
Zhang, Linhao
Xin, Shulin
Chen, Lu
Liu, Qi
Zhong, Xiaojian
Li, Aoyan
Liu, Siyao
Xiao, Yongsheng
Chen, Liangqiang
Zhang, Yuyu
Su, Jing
Liu, Tianyu
Long, Rui
Shen, Kai
Xiang, Liang
contents The task of issue resolving is to modify a codebase to generate a patch that addresses a given issue. However, existing benchmarks, such as SWE-bench, focus almost exclusively on Python, making them insufficient for evaluating Large Language Models (LLMs) across diverse software ecosystems. To address this, we introduce a multilingual issue-resolving benchmark, called Multi-SWE-bench, covering Java, TypeScript, JavaScript, Go, Rust, C, and C++. It includes a total of 1,632 high-quality instances, which were carefully annotated from 2,456 candidates by 68 expert annotators, ensuring that the benchmark can provide an accurate and reliable evaluation. Based on Multi-SWE-bench, we evaluate a series of state-of-the-art models using three representative methods (Agentless, SWE-agent, and OpenHands) and present a comprehensive analysis with key empirical insights. In addition, we launch a Multi-SWE-RL open-source community, aimed at building large-scale reinforcement learning (RL) training datasets for issue-resolving tasks. As an initial contribution, we release a set of 4,723 well-structured instances spanning seven programming languages, laying a solid foundation for RL research in this domain. More importantly, we open-source our entire data production pipeline, along with detailed tutorials, encouraging the open-source community to continuously contribute and expand the dataset. We envision our Multi-SWE-bench and the ever-growing Multi-SWE-RL community as catalysts for advancing RL toward its full potential, bringing us one step closer to the dawn of AGI.
format Preprint
id arxiv_https___arxiv_org_abs_2504_02605
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving
Zan, Daoguang
Huang, Zhirong
Liu, Wei
Chen, Hanwu
Zhang, Linhao
Xin, Shulin
Chen, Lu
Liu, Qi
Zhong, Xiaojian
Li, Aoyan
Liu, Siyao
Xiao, Yongsheng
Chen, Liangqiang
Zhang, Yuyu
Su, Jing
Liu, Tianyu
Long, Rui
Shen, Kai
Xiang, Liang
Software Engineering
Artificial Intelligence
Computation and Language
The task of issue resolving is to modify a codebase to generate a patch that addresses a given issue. However, existing benchmarks, such as SWE-bench, focus almost exclusively on Python, making them insufficient for evaluating Large Language Models (LLMs) across diverse software ecosystems. To address this, we introduce a multilingual issue-resolving benchmark, called Multi-SWE-bench, covering Java, TypeScript, JavaScript, Go, Rust, C, and C++. It includes a total of 1,632 high-quality instances, which were carefully annotated from 2,456 candidates by 68 expert annotators, ensuring that the benchmark can provide an accurate and reliable evaluation. Based on Multi-SWE-bench, we evaluate a series of state-of-the-art models using three representative methods (Agentless, SWE-agent, and OpenHands) and present a comprehensive analysis with key empirical insights. In addition, we launch a Multi-SWE-RL open-source community, aimed at building large-scale reinforcement learning (RL) training datasets for issue-resolving tasks. As an initial contribution, we release a set of 4,723 well-structured instances spanning seven programming languages, laying a solid foundation for RL research in this domain. More importantly, we open-source our entire data production pipeline, along with detailed tutorials, encouraging the open-source community to continuously contribute and expand the dataset. We envision our Multi-SWE-bench and the ever-growing Multi-SWE-RL community as catalysts for advancing RL toward its full potential, bringing us one step closer to the dawn of AGI.
title Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving
topic Software Engineering
Artificial Intelligence
Computation and Language
url https://arxiv.org/abs/2504.02605