Saved in:
| Main Authors: | Deng, Le, Jiang, Zhonghao, Cao, Jialun, Pradel, Michael, Liu, Zhongxin |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2507.18130 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Are "Solved Issues" in SWE-bench Really Solved Correctly? An Empirical Study
by: Wang, You, et al.
Published: (2025)
by: Wang, You, et al.
Published: (2025)
Names Are All You Need: Effective and Safe Regression Test Selection for Python
by: Wang, You, et al.
Published: (2026)
by: Wang, You, et al.
Published: (2026)
iCoRe: An Iterative Correlation-Aware Retriever for Bug Reproduction Test Generation
by: Wang, Junyi, et al.
Published: (2026)
by: Wang, Junyi, et al.
Published: (2026)
Agentic Software Issue Resolution with Large Language Models: A Survey
by: Jiang, Zhonghao, et al.
Published: (2025)
by: Jiang, Zhonghao, et al.
Published: (2025)
Testora: Using Natural Language Intent to Detect Behavioral Regressions
by: Pradel, Michael
Published: (2025)
by: Pradel, Michael
Published: (2025)
CodeMapper: A Language-Agnostic Approach to Mapping Code Regions Across Commits
by: Hu, Huimin, et al.
Published: (2025)
by: Hu, Huimin, et al.
Published: (2025)
FeedbackEval: A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks
by: Dai, Dekun, et al.
Published: (2025)
by: Dai, Dekun, et al.
Published: (2025)
Issue Localization via LLM-Driven Iterative Code Graph Searching
by: Jiang, Zhonghao, et al.
Published: (2025)
by: Jiang, Zhonghao, et al.
Published: (2025)
PatchGuru: Patch Oracle Inference from Natural Language Artifacts with Large Language Models
by: Le-Cong, Thanh, et al.
Published: (2026)
by: Le-Cong, Thanh, et al.
Published: (2026)
A Benchmark for Evaluating Repository-Level Code Agents with Intermediate Reasoning on Feature Addition Task
by: Liu, Shuhan, et al.
Published: (2026)
by: Liu, Shuhan, et al.
Published: (2026)
De-Hallucinator: Mitigating LLM Hallucinations in Code Generation Tasks via Iterative Grounding
by: Eghbali, Aryaz, et al.
Published: (2024)
by: Eghbali, Aryaz, et al.
Published: (2024)
Artisan: Agentic Artifact Evaluation
by: Baek, Doehyun, et al.
Published: (2026)
by: Baek, Doehyun, et al.
Published: (2026)
SolContractEval: A Benchmark for Evaluating Contract-Level Solidity Code Generation
by: Ye, Zhifan, et al.
Published: (2025)
by: Ye, Zhifan, et al.
Published: (2025)
RealSec-bench: A Benchmark for Evaluating Secure Code Generation in Real-World Repositories
by: Wang, Yanlin, et al.
Published: (2026)
by: Wang, Yanlin, et al.
Published: (2026)
FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation
by: Li, Wei, et al.
Published: (2025)
by: Li, Wei, et al.
Published: (2025)
DyPyBench: A Benchmark of Executable Python Software
by: Bouzenia, Islem, et al.
Published: (2024)
by: Bouzenia, Islem, et al.
Published: (2024)
ChangeGuard: Validating Code Changes via Pairwise Learning-Guided Execution
by: Gröninger, Lars, et al.
Published: (2024)
by: Gröninger, Lars, et al.
Published: (2024)
Improving Retrieval-Augmented Code Comment Generation by Retrieving for Generation
by: Lu, Hanzhen, et al.
Published: (2024)
by: Lu, Hanzhen, et al.
Published: (2024)
Isolating Language-Coding from Problem-Solving: Benchmarking LLMs with PseudoEval
by: Wu, Jiarong, et al.
Published: (2025)
by: Wu, Jiarong, et al.
Published: (2025)
JavaBench: A Benchmark of Object-Oriented Code Generation for Evaluating Large Language Models
by: Cao, Jialun, et al.
Published: (2024)
by: Cao, Jialun, et al.
Published: (2024)
Concerned with Data Contamination? Assessing Countermeasures in Code Language Model
by: Cao, Jialun, et al.
Published: (2024)
by: Cao, Jialun, et al.
Published: (2024)
DOMAINEVAL: An Auto-Constructed Benchmark for Multi-Domain Code Generation
by: Zhu, Qiming, et al.
Published: (2024)
by: Zhu, Qiming, et al.
Published: (2024)
Evaluating LLM Agents on Automated Software Analysis Tasks
by: Bouzenia, Islem, et al.
Published: (2026)
by: Bouzenia, Islem, et al.
Published: (2026)
Enhancing Project-Specific Code Completion by Inferring Internal API Information
by: Deng, Le, et al.
Published: (2025)
by: Deng, Le, et al.
Published: (2025)
CodeCureAgent: Automatic Classification and Repair of Static Analysis Warnings
by: Joos, Pascal, et al.
Published: (2025)
by: Joos, Pascal, et al.
Published: (2025)
Instructive Code Retriever: Learn from Large Language Model's Feedback for Code Intelligence Tasks
by: Lu, Jiawei, et al.
Published: (2024)
by: Lu, Jiawei, et al.
Published: (2024)
Across Programming Language Silos: A Study on Cross-Lingual Retrieval-augmented Code Generation
by: Zhu, Qiming, et al.
Published: (2025)
by: Zhu, Qiming, et al.
Published: (2025)
What Builds Effective In-Context Examples for Code Generation?
by: Li, Dongze, et al.
Published: (2025)
by: Li, Dongze, et al.
Published: (2025)
From Bugs to Benchmarks: A Comprehensive Survey of Software Defect Datasets
by: Zhu, Hao-Nan, et al.
Published: (2025)
by: Zhu, Hao-Nan, et al.
Published: (2025)
UniCoR: Modality Collaboration for Robust Cross-Language Hybrid Code Retrieval
by: Yang, Yang, et al.
Published: (2025)
by: Yang, Yang, et al.
Published: (2025)
CodeReasoner: Enhancing the Code Reasoning Ability with Reinforcement Learning
by: Tang, Lingxiao, et al.
Published: (2025)
by: Tang, Lingxiao, et al.
Published: (2025)
Self-Explained Keywords Empower Large Language Models for Code Generation
by: Fan, Lishui, et al.
Published: (2024)
by: Fan, Lishui, et al.
Published: (2024)
Understanding Software Engineering Agents: A Study of Thought-Action-Result Trajectories
by: Bouzenia, Islem, et al.
Published: (2025)
by: Bouzenia, Islem, et al.
Published: (2025)
RippleGUItester: Change-Aware Exploratory Testing
by: Su, Yanqi, et al.
Published: (2026)
by: Su, Yanqi, et al.
Published: (2026)
Treefix: Enabling Execution with a Tree of Prefixes
by: Souza, Beatriz, et al.
Published: (2025)
by: Souza, Beatriz, et al.
Published: (2025)
You Name It, I Run It: An LLM Agent to Execute Tests of Arbitrary Projects
by: Bouzenia, Islem, et al.
Published: (2024)
by: Bouzenia, Islem, et al.
Published: (2024)
AgentStepper: Interactive Debugging of Software Development Agents
by: Hutter, Robert, et al.
Published: (2026)
by: Hutter, Robert, et al.
Published: (2026)
Unit Test Update through LLM-Driven Context Collection and Error-Type-Aware Refinement
by: Zhang, Yuanhe, et al.
Published: (2025)
by: Zhang, Yuanhe, et al.
Published: (2025)
Do Not Treat Code as Natural Language: Implications for Repository-Level Code Generation and Beyond
by: Le-Anh, Minh, et al.
Published: (2026)
by: Le-Anh, Minh, et al.
Published: (2026)
TimeMachine-bench: A Benchmark for Evaluating Model Capabilities in Repository-Level Migration Tasks
by: Fujii, Ryo, et al.
Published: (2026)
by: Fujii, Ryo, et al.
Published: (2026)
Similar Items
-
Are "Solved Issues" in SWE-bench Really Solved Correctly? An Empirical Study
by: Wang, You, et al.
Published: (2025) -
Names Are All You Need: Effective and Safe Regression Test Selection for Python
by: Wang, You, et al.
Published: (2026) -
iCoRe: An Iterative Correlation-Aware Retriever for Bug Reproduction Test Generation
by: Wang, Junyi, et al.
Published: (2026) -
Agentic Software Issue Resolution with Large Language Models: A Survey
by: Jiang, Zhonghao, et al.
Published: (2025) -
Testora: Using Natural Language Intent to Detect Behavioral Regressions
by: Pradel, Michael
Published: (2025)