Saved in:
Bibliographic Details
Main Authors: Oliva, Gustavo A., Rajbahadur, Gopi Krishnan, Bhatia, Aaditya, Zhang, Haoxiang, Chen, Yihao, Chen, Zhilong, Leung, Arthur, Lin, Dayi, Chen, Boyuan, Hassan, Ahmed E.
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2507.09108
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866916955959590912
author Oliva, Gustavo A.
Rajbahadur, Gopi Krishnan
Bhatia, Aaditya
Zhang, Haoxiang
Chen, Yihao
Chen, Zhilong
Leung, Arthur
Lin, Dayi
Chen, Boyuan
Hassan, Ahmed E.
author_facet Oliva, Gustavo A.
Rajbahadur, Gopi Krishnan
Bhatia, Aaditya
Zhang, Haoxiang
Chen, Yihao
Chen, Zhilong
Leung, Arthur
Lin, Dayi
Chen, Boyuan
Hassan, Ahmed E.
contents High-quality labeled datasets are crucial for training and evaluating foundation models in software engineering, but creating them is often prohibitively expensive and labor-intensive. We introduce SPICE, a scalable, automated pipeline for labeling SWE-bench-style datasets with annotations for issue clarity, test coverage, and effort estimation. SPICE combines context-aware code navigation, rationale-driven prompting, and multi-pass consensus to produce labels that closely approximate expert annotations. SPICE's design was informed by our own experience and frustration in labeling more than 800 instances from SWE-Gym. SPICE achieves strong agreement with human-labeled SWE-bench Verified data while reducing the cost of labeling 1,000 instances from around \$100,000 (manual annotation) to just \$5.10. These results demonstrate SPICE's potential to enable cost-effective, large-scale dataset creation for SE-focused FMs. To support the community, we release both SPICE tool and SPICE Bench, a new dataset of 6,802 SPICE-labeled instances curated from 291 open-source projects in SWE-Gym (over 13x larger than SWE-bench Verified).
format Preprint
id arxiv_https___arxiv_org_abs_2507_09108
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle SPICE: An Automated SWE-Bench Labeling Pipeline for Issue Clarity, Test Coverage, and Effort Estimation
Oliva, Gustavo A.
Rajbahadur, Gopi Krishnan
Bhatia, Aaditya
Zhang, Haoxiang
Chen, Yihao
Chen, Zhilong
Leung, Arthur
Lin, Dayi
Chen, Boyuan
Hassan, Ahmed E.
Software Engineering
Artificial Intelligence
High-quality labeled datasets are crucial for training and evaluating foundation models in software engineering, but creating them is often prohibitively expensive and labor-intensive. We introduce SPICE, a scalable, automated pipeline for labeling SWE-bench-style datasets with annotations for issue clarity, test coverage, and effort estimation. SPICE combines context-aware code navigation, rationale-driven prompting, and multi-pass consensus to produce labels that closely approximate expert annotations. SPICE's design was informed by our own experience and frustration in labeling more than 800 instances from SWE-Gym. SPICE achieves strong agreement with human-labeled SWE-bench Verified data while reducing the cost of labeling 1,000 instances from around \$100,000 (manual annotation) to just \$5.10. These results demonstrate SPICE's potential to enable cost-effective, large-scale dataset creation for SE-focused FMs. To support the community, we release both SPICE tool and SPICE Bench, a new dataset of 6,802 SPICE-labeled instances curated from 291 open-source projects in SWE-Gym (over 13x larger than SWE-bench Verified).
title SPICE: An Automated SWE-Bench Labeling Pipeline for Issue Clarity, Test Coverage, and Effort Estimation
topic Software Engineering
Artificial Intelligence
url https://arxiv.org/abs/2507.09108