Saved in:
| Main Authors: | Prathifkumar, Thanosan, Mathews, Noble Saji, Nagappan, Meiyappan |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2512.10218 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Test-Driven Development for Code Generation
by: Mathews, Noble Saji, et al.
Published: (2024)
by: Mathews, Noble Saji, et al.
Published: (2024)
Is Your Automated Software Engineer Trustworthy?
by: Mathews, Noble Saji, et al.
Published: (2025)
by: Mathews, Noble Saji, et al.
Published: (2025)
Design choices made by LLM-based test generators prevent them from finding bugs
by: Mathews, Noble Saji, et al.
Published: (2024)
by: Mathews, Noble Saji, et al.
Published: (2024)
AssertFlip: Reproducing Bugs via Inversion of LLM-Generated Passing Tests
by: Khatib, Lara, et al.
Published: (2025)
by: Khatib, Lara, et al.
Published: (2025)
FuzzSlice: Pruning False Positives in Static Analysis Warnings Through Function-Level Fuzzing
by: Murali, Aniruddhan, et al.
Published: (2024)
by: Murali, Aniruddhan, et al.
Published: (2024)
LLbezpeky: Leveraging Large Language Models for Vulnerability Detection
by: Mathews, Noble Saji, et al.
Published: (2024)
by: Mathews, Noble Saji, et al.
Published: (2024)
Examining LLMs Ability to Summarize Code Through Mutation-Analysis
by: Khatib, Lara, et al.
Published: (2026)
by: Khatib, Lara, et al.
Published: (2026)
Whodunit: Classifying Code as Human Authored or GPT-4 Generated -- A case study on CodeChef problems
by: Idialu, Oseremen Joy, et al.
Published: (2024)
by: Idialu, Oseremen Joy, et al.
Published: (2024)
Understanding the Human-LLM Dynamic: A Literature Survey of LLM Use in Programming Tasks
by: Etsenake, Deborah, et al.
Published: (2024)
by: Etsenake, Deborah, et al.
Published: (2024)
Is GitHub's Copilot as Bad as Humans at Introducing Vulnerabilities in Code?
by: Asare, Owura, et al.
Published: (2022)
by: Asare, Owura, et al.
Published: (2022)
A User-centered Security Evaluation of Copilot
by: Asare, Owura, et al.
Published: (2023)
by: Asare, Owura, et al.
Published: (2023)
RLocator: Reinforcement Learning for Bug Localization
by: Chakraborty, Partha, et al.
Published: (2023)
by: Chakraborty, Partha, et al.
Published: (2023)
BLAZE: Cross-Language and Cross-Project Bug Localization via Dynamic Chunking and Hard Example Learning
by: Chakraborty, Partha, et al.
Published: (2024)
by: Chakraborty, Partha, et al.
Published: (2024)
Measuring the Runtime Performance of C++ Code Written by Humans using GitHub Copilot
by: Erhabor, Daniel, et al.
Published: (2023)
by: Erhabor, Daniel, et al.
Published: (2023)
Aligning Programming Language and Natural Language: Exploring Design Choices in Multi-Modal Transformer-Based Embedding for Bug Localization
by: Chakraborty, Partha, et al.
Published: (2024)
by: Chakraborty, Partha, et al.
Published: (2024)
AddressWatcher: Sanitizer-Based Localization of Memory Leak Fixes
by: Murali, Aniruddhan, et al.
Published: (2024)
by: Murali, Aniruddhan, et al.
Published: (2024)
Training Software Engineering Agents and Verifiers with SWE-Gym
by: Pan, Jiayi, et al.
Published: (2024)
by: Pan, Jiayi, et al.
Published: (2024)
SWE-TRACE: Optimizing Long-Horizon SWE Agents Through Rubric Process Reward Models and Heuristic Test-Time Scaling
by: Han, Hao, et al.
Published: (2026)
by: Han, Hao, et al.
Published: (2026)
SWE-Bench+: Enhanced Coding Benchmark for LLMs
by: Aleithan, Reem, et al.
Published: (2024)
by: Aleithan, Reem, et al.
Published: (2024)
CodeSAM: Source Code Representation Learning by Infusing Self-Attention with Multi-Code-View Graphs
by: Mathai, Alex, et al.
Published: (2024)
by: Mathai, Alex, et al.
Published: (2024)
SWE-Bench Mobile: Can Large Language Model Agents Develop Industry-Level Mobile Applications?
by: Tian, Muxin, et al.
Published: (2026)
by: Tian, Muxin, et al.
Published: (2026)
SWE-Bench-CL: Continual Learning for Coding Agents
by: Joshi, Thomas, et al.
Published: (2025)
by: Joshi, Thomas, et al.
Published: (2025)
Saving SWE-Bench: A Benchmark Mutation Approach for Realistic Agent Evaluation
by: Garg, Spandan, et al.
Published: (2025)
by: Garg, Spandan, et al.
Published: (2025)
Reproduction Test Generation for Java SWE Issues
by: Ahmed, Toufique, et al.
Published: (2026)
by: Ahmed, Toufique, et al.
Published: (2026)
UTBoost: Rigorous Evaluation of Coding Agents on SWE-Bench
by: Yu, Boxi, et al.
Published: (2025)
by: Yu, Boxi, et al.
Published: (2025)
What's in a Benchmark? The Case of SWE-Bench in Automated Program Repair
by: Martinez, Matias, et al.
Published: (2026)
by: Martinez, Matias, et al.
Published: (2026)
Towards Understanding Barriers and Mitigation Strategies of Software Engineers with Non-traditional Educational and Occupational Backgrounds
by: Barnes, Tavian, et al.
Published: (2022)
by: Barnes, Tavian, et al.
Published: (2022)
SWE-Universe: Scale Real-World Verifiable Environments to Millions
by: Chen, Mouxiang, et al.
Published: (2026)
by: Chen, Mouxiang, et al.
Published: (2026)
SPICE: An Automated SWE-Bench Labeling Pipeline for Issue Clarity, Test Coverage, and Effort Estimation
by: Oliva, Gustavo A., et al.
Published: (2025)
by: Oliva, Gustavo A., et al.
Published: (2025)
SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models
by: Xu, Jingxuan, et al.
Published: (2025)
by: Xu, Jingxuan, et al.
Published: (2025)
SWE-Sharp-Bench: A Reproducible Benchmark for C# Software Engineering Tasks
by: Mhatre, Sanket, et al.
Published: (2025)
by: Mhatre, Sanket, et al.
Published: (2025)
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?
by: Deng, Xiang, et al.
Published: (2025)
by: Deng, Xiang, et al.
Published: (2025)
SWE-WebDevBench: Evaluating Coding Agent Application Platforms as Virtual Software Agencies
by: Saxena, Siddhant, et al.
Published: (2026)
by: Saxena, Siddhant, et al.
Published: (2026)
SWE-Skills-Bench: Do Agent Skills Actually Help in Real-World Software Engineering?
by: Han, Tingxu, et al.
Published: (2026)
by: Han, Tingxu, et al.
Published: (2026)
Beyond Verifiable Rewards: Rubric-Based GRM for Reinforced Fine-Tuning SWE Agents
by: Huang, Jiawei, et al.
Published: (2026)
by: Huang, Jiawei, et al.
Published: (2026)
Investigating Test Overfitting on SWE-bench
by: Ahmed, Toufique, et al.
Published: (2025)
by: Ahmed, Toufique, et al.
Published: (2025)
SWE Context Bench: A Benchmark for Context Learning in Coding
by: Zhu, Jiayuan, et al.
Published: (2026)
by: Zhu, Jiayuan, et al.
Published: (2026)
SWE-Shepherd: Advancing PRMs for Reinforcing Code Agents
by: Dihan, Mahir Labib, et al.
Published: (2026)
by: Dihan, Mahir Labib, et al.
Published: (2026)
SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents
by: Rashid, Muhammad Shihab, et al.
Published: (2025)
by: Rashid, Muhammad Shihab, et al.
Published: (2025)
GHIssuemarket: A Sandbox Environment for SWE-Agents Economic Experimentation
by: Fouad, Mohamed A., et al.
Published: (2024)
by: Fouad, Mohamed A., et al.
Published: (2024)
Similar Items
-
Test-Driven Development for Code Generation
by: Mathews, Noble Saji, et al.
Published: (2024) -
Is Your Automated Software Engineer Trustworthy?
by: Mathews, Noble Saji, et al.
Published: (2025) -
Design choices made by LLM-based test generators prevent them from finding bugs
by: Mathews, Noble Saji, et al.
Published: (2024) -
AssertFlip: Reproducing Bugs via Inversion of LLM-Generated Passing Tests
by: Khatib, Lara, et al.
Published: (2025) -
FuzzSlice: Pruning False Positives in Static Analysis Warnings Through Function-Level Fuzzing
by: Murali, Aniruddhan, et al.
Published: (2024)