Saved in:
| Main Authors: | Liu, Jingyao, Huang, Chen, Guan, Zhizhao, Lei, Wenqiang, Deng, Yang |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2510.14509 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
ProjDevBench: Benchmarking AI Coding Agents on End-to-End Project Development
by: Lu, Pengrui, et al.
Published: (2026)
by: Lu, Pengrui, et al.
Published: (2026)
SWE-MERA: A Dynamic Benchmark for Agenticly Evaluating Large Language Models on Software Engineering Tasks
by: Adamenko, Pavel, et al.
Published: (2025)
by: Adamenko, Pavel, et al.
Published: (2025)
EvoDev: An Iterative Feature-Driven Framework for End-to-End Software Development with LLM-based Agents
by: Liu, Junwei, et al.
Published: (2025)
by: Liu, Junwei, et al.
Published: (2025)
Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development
by: Tran, Hung, et al.
Published: (2026)
by: Tran, Hung, et al.
Published: (2026)
Automated Web Application Testing: End-to-End Test Case Generation with Large Language Models and Screen Transition Graphs
by: Le, Nguyen-Khang, et al.
Published: (2025)
by: Le, Nguyen-Khang, et al.
Published: (2025)
Software Development Life Cycle Perspective: A Survey of Benchmarks for Code Large Language Models and Agents
by: Wang, Kaixin, et al.
Published: (2025)
by: Wang, Kaixin, et al.
Published: (2025)
WebTestBench: Evaluating Computer-Use Agents towards End-to-End Automated Web Testing
by: Kong, Fanheng, et al.
Published: (2026)
by: Kong, Fanheng, et al.
Published: (2026)
A Layered Architecture for Developing and Enhancing Capabilities in Large Language Model-based Software Systems
by: Zhang, Dawen, et al.
Published: (2024)
by: Zhang, Dawen, et al.
Published: (2024)
Multilingual Multimodal Software Developer for Code Generation
by: Chai, Linzheng, et al.
Published: (2025)
by: Chai, Linzheng, et al.
Published: (2025)
Evaluating LLM-Based 0-to-1 Software Generation in End-to-End CLI Tool Scenarios
by: Hu, Ruida, et al.
Published: (2026)
by: Hu, Ruida, et al.
Published: (2026)
How Well Do Large Language Models Serve as End-to-End Secure Code Agents for Python?
by: Gong, Jianian, et al.
Published: (2024)
by: Gong, Jianian, et al.
Published: (2024)
ArchCode: Incorporating Software Requirements in Code Generation with Large Language Models
by: Han, Hojae, et al.
Published: (2024)
by: Han, Hojae, et al.
Published: (2024)
Solving Data-centric Tasks using Large Language Models
by: Barke, Shraddha, et al.
Published: (2024)
by: Barke, Shraddha, et al.
Published: (2024)
Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination
by: Chen, Simin, et al.
Published: (2025)
by: Chen, Simin, et al.
Published: (2025)
KOCO-BENCH: Can Large Language Models Leverage Domain Knowledge in Software Development?
by: Jiang, Xue, et al.
Published: (2026)
by: Jiang, Xue, et al.
Published: (2026)
CodeMind: Evaluating Large Language Models for Code Reasoning
by: Liu, Changshu, et al.
Published: (2024)
by: Liu, Changshu, et al.
Published: (2024)
FrontendBench: A Benchmark for Evaluating LLMs on Front-End Development via Automatic Evaluation
by: Zhu, Hongda, et al.
Published: (2025)
by: Zhu, Hongda, et al.
Published: (2025)
A Code Comprehension Benchmark for Large Language Models for Code
by: Havare, Jayant, et al.
Published: (2025)
by: Havare, Jayant, et al.
Published: (2025)
Efficient Detection of Toxic Prompts in Large Language Models
by: Liu, Yi, et al.
Published: (2024)
by: Liu, Yi, et al.
Published: (2024)
TRUSTVIS: A Multi-Dimensional Trustworthiness Evaluation Framework for Large Language Models
by: Sun, Ruoyu, et al.
Published: (2025)
by: Sun, Ruoyu, et al.
Published: (2025)
Large Language Models for In-File Vulnerability Localization Can Be "Lost in the End"
by: Sovrano, Francesco, et al.
Published: (2025)
by: Sovrano, Francesco, et al.
Published: (2025)
Look Before You Leap: An Exploratory Study of Uncertainty Measurement for Large Language Models
by: Huang, Yuheng, et al.
Published: (2023)
by: Huang, Yuheng, et al.
Published: (2023)
SWE-AGI: Benchmarking Specification-Driven Software Construction with MoonBit in the Era of Autonomous Agents
by: Zhang, Zhirui, et al.
Published: (2026)
by: Zhang, Zhirui, et al.
Published: (2026)
Bench4HLS: End-to-End Evaluation of LLMs in High-Level Synthesis Code Generation
by: Khan, M Zafir Sadik, et al.
Published: (2026)
by: Khan, M Zafir Sadik, et al.
Published: (2026)
Unifying the Perspectives of NLP and Software Engineering: A Survey on Language Models for Code
by: Zhang, Ziyin, et al.
Published: (2023)
by: Zhang, Ziyin, et al.
Published: (2023)
Tree-of-Code: A Tree-Structured Exploring Framework for End-to-End Code Generation and Execution in Complex Task Handling
by: Ni, Ziyi, et al.
Published: (2024)
by: Ni, Ziyi, et al.
Published: (2024)
Beyond Postconditions: Can Large Language Models infer Formal Contracts for Automatic Software Verification?
by: Richter, Cedric, et al.
Published: (2025)
by: Richter, Cedric, et al.
Published: (2025)
OmniCode: A Benchmark for Evaluating Software Engineering Agents
by: Sonwane, Atharv, et al.
Published: (2026)
by: Sonwane, Atharv, et al.
Published: (2026)
A Preliminary Study of Multilingual Code Language Models for Code Generation Task Using Translated Benchmarks
by: Dandamudi, Rohit, et al.
Published: (2024)
by: Dandamudi, Rohit, et al.
Published: (2024)
CODEMENV: Benchmarking Large Language Models on Code Migration
by: Cheng, Keyuan, et al.
Published: (2025)
by: Cheng, Keyuan, et al.
Published: (2025)
Iterative Experience Refinement of Software-Developing Agents
by: Qian, Chen, et al.
Published: (2024)
by: Qian, Chen, et al.
Published: (2024)
Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models
by: Zheng, Jiasheng, et al.
Published: (2024)
by: Zheng, Jiasheng, et al.
Published: (2024)
Co-Saving: Resource Aware Multi-Agent Collaboration for Software Development
by: Qiu, Rennai, et al.
Published: (2025)
by: Qiu, Rennai, et al.
Published: (2025)
Experiential Co-Learning of Software-Developing Agents
by: Qian, Chen, et al.
Published: (2023)
by: Qian, Chen, et al.
Published: (2023)
Transducer Tuning: Efficient Model Adaptation for Software Tasks Using Code Property Graphs
by: Yusuf, Imam Nur Bani, et al.
Published: (2024)
by: Yusuf, Imam Nur Bani, et al.
Published: (2024)
ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development
by: Yang, Jie, et al.
Published: (2026)
by: Yang, Jie, et al.
Published: (2026)
BuildBench: Benchmarking LLM Agents on Compiling Real-World Open-Source Software
by: Zhang, Zehua, et al.
Published: (2025)
by: Zhang, Zehua, et al.
Published: (2025)
Self-Explained Keywords Empower Large Language Models for Code Generation
by: Fan, Lishui, et al.
Published: (2024)
by: Fan, Lishui, et al.
Published: (2024)
Enhancing Large Language Models in Coding Through Multi-Perspective Self-Consistency
by: Huang, Baizhou, et al.
Published: (2023)
by: Huang, Baizhou, et al.
Published: (2023)
Process-Centric Analysis of Agentic Software Systems
by: Liu, Shuyang, et al.
Published: (2025)
by: Liu, Shuyang, et al.
Published: (2025)
Similar Items
-
ProjDevBench: Benchmarking AI Coding Agents on End-to-End Project Development
by: Lu, Pengrui, et al.
Published: (2026) -
SWE-MERA: A Dynamic Benchmark for Agenticly Evaluating Large Language Models on Software Engineering Tasks
by: Adamenko, Pavel, et al.
Published: (2025) -
EvoDev: An Iterative Feature-Driven Framework for End-to-End Software Development with LLM-based Agents
by: Liu, Junwei, et al.
Published: (2025) -
Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development
by: Tran, Hung, et al.
Published: (2026) -
Automated Web Application Testing: End-to-End Test Case Generation with Large Language Models and Screen Transition Graphs
by: Le, Nguyen-Khang, et al.
Published: (2025)