Saved in:
| Main Authors: | Szych, Joanna, Schwerk, Anne |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.09059 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Benchmarking and Studying the LLM-based Code Review
by: Zeng, Zhengran, et al.
Published: (2025)
by: Zeng, Zhengran, et al.
Published: (2025)
A Survey of Code Review Benchmarks and Evaluation Practices in Pre-LLM and LLM Era
by: Khan, Taufiqul Islam, et al.
Published: (2026)
by: Khan, Taufiqul Islam, et al.
Published: (2026)
Tests as Prompt: A Test-Driven-Development Benchmark for LLM Code Generation
by: Cui, Yi
Published: (2025)
by: Cui, Yi
Published: (2025)
CodeArena: A Collective Evaluation Platform for LLM Code Generation
by: Du, Mingzhe, et al.
Published: (2025)
by: Du, Mingzhe, et al.
Published: (2025)
Automatic Generation of Benchmarks and Reliable LLM Judgment for Code Tasks
by: Farchi, Eitan, et al.
Published: (2024)
by: Farchi, Eitan, et al.
Published: (2024)
LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation
by: Fakhoury, Sarah, et al.
Published: (2024)
by: Fakhoury, Sarah, et al.
Published: (2024)
The Fault in our Stars: Quality Assessment of Code Generation Benchmarks
by: Siddiq, Mohammed Latif, et al.
Published: (2024)
by: Siddiq, Mohammed Latif, et al.
Published: (2024)
Comparing Developer and LLM Biases in Code Evaluation
by: Mittal, Aditya, et al.
Published: (2026)
by: Mittal, Aditya, et al.
Published: (2026)
DSL or Code? Evaluating the Quality of LLM-Generated Algebraic Specifications: A Case Study in Optimization at Kinaxis
by: Ayoughi, Negin, et al.
Published: (2026)
by: Ayoughi, Negin, et al.
Published: (2026)
COFFE: A Code Efficiency Benchmark for Code Generation
by: Peng, Yun, et al.
Published: (2025)
by: Peng, Yun, et al.
Published: (2025)
UA-Code-Bench: A Competitive Programming Benchmark for Evaluating LLM Code Generation in Ukrainian
by: Syromiatnikov, Mykyta, et al.
Published: (2025)
by: Syromiatnikov, Mykyta, et al.
Published: (2025)
Evaluating Efficiency and Novelty of LLM-Generated Code for Graph Analysis
by: Nia, Atieh Barati, et al.
Published: (2025)
by: Nia, Atieh Barati, et al.
Published: (2025)
Re-Evaluating Code LLM Benchmarks Under Semantic Mutation
by: Pan, Zhiyuan, et al.
Published: (2025)
by: Pan, Zhiyuan, et al.
Published: (2025)
Continuous Benchmark Generation for Evaluating Enterprise-scale LLM Agents
by: Saxena, Divyanshu, et al.
Published: (2025)
by: Saxena, Divyanshu, et al.
Published: (2025)
Human or LLM? A Comparative Study on Accessible Code Generation Capability
by: Suh, Hyunjae, et al.
Published: (2025)
by: Suh, Hyunjae, et al.
Published: (2025)
Development and Benchmarking of Multilingual Code Clone Detector
by: Zhu, Wenqing, et al.
Published: (2024)
by: Zhu, Wenqing, et al.
Published: (2024)
SolContractEval: A Benchmark for Evaluating Contract-Level Solidity Code Generation
by: Ye, Zhifan, et al.
Published: (2025)
by: Ye, Zhifan, et al.
Published: (2025)
LLMs in Web Development: Evaluating LLM-Generated PHP Code Unveiling Vulnerabilities and Limitations
by: Tóth, Rebeka, et al.
Published: (2024)
by: Tóth, Rebeka, et al.
Published: (2024)
HumanEvalComm: Benchmarking the Communication Competence of Code Generation for LLMs and LLM Agent
by: Wu, Jie JW, et al.
Published: (2024)
by: Wu, Jie JW, et al.
Published: (2024)
Beyond Code Similarity: Benchmarking the Plausibility, Efficiency, and Complexity of LLM-Generated Smart Contracts
by: Salzano, Francesco, et al.
Published: (2025)
by: Salzano, Francesco, et al.
Published: (2025)
Assessing Small Language Models for Code Generation: An Empirical Study with Benchmarks
by: Hasan, Md Mahade, et al.
Published: (2025)
by: Hasan, Md Mahade, et al.
Published: (2025)
ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code
by: Feng, Jia, et al.
Published: (2024)
by: Feng, Jia, et al.
Published: (2024)
Benchmarking and Studying the LLM-based Agent System in End-to-End Software Development
by: Zeng, Zhengran, et al.
Published: (2025)
by: Zeng, Zhengran, et al.
Published: (2025)
Benchmarks and Metrics for Evaluations of Code Generation: A Critical Review
by: Paul, Debalina Ghosh, et al.
Published: (2024)
by: Paul, Debalina Ghosh, et al.
Published: (2024)
A Differential Fuzzing-Based Evaluation of Functional Equivalence in LLM-Generated Code Refactorings
by: Dristi, Simantika Bhattacharjee, et al.
Published: (2026)
by: Dristi, Simantika Bhattacharjee, et al.
Published: (2026)
Are They All Good? Evaluating the Quality of CoTs in LLM-based Code Generation
by: Zhang, Binquan, et al.
Published: (2025)
by: Zhang, Binquan, et al.
Published: (2025)
SemGuard: Real-Time Semantic Evaluator for Correcting LLM-Generated Code
by: Wang, Qinglin, et al.
Published: (2025)
by: Wang, Qinglin, et al.
Published: (2025)
Cross-Task Benchmarking and Evaluation of General-Purpose and Code-Specific Large Language Models
by: Das, Gunjan, et al.
Published: (2025)
by: Das, Gunjan, et al.
Published: (2025)
A Performance Study of LLM-Generated Code on Leetcode
by: Coignion, Tristan, et al.
Published: (2024)
by: Coignion, Tristan, et al.
Published: (2024)
EvoCodeBench: An Evolving Code Generation Benchmark with Domain-Specific Evaluations
by: Li, Jia, et al.
Published: (2024)
by: Li, Jia, et al.
Published: (2024)
Inducing Vulnerable Code Generation in LLM Coding Assistants
by: Zeng, Binqi, et al.
Published: (2025)
by: Zeng, Binqi, et al.
Published: (2025)
RealBench: A Repo-Level Code Generation Benchmark Aligned with Real-World Software Development Practices
by: Li, Jia, et al.
Published: (2026)
by: Li, Jia, et al.
Published: (2026)
ScenEval: A Benchmark for Scenario-Based Evaluation of Code Generation
by: Paul, Debalina Ghosh, et al.
Published: (2024)
by: Paul, Debalina Ghosh, et al.
Published: (2024)
Automatically Benchmarking LLM Code Agents through Agent-Driven Annotation and Evaluation
by: Fu, Lingyue, et al.
Published: (2025)
by: Fu, Lingyue, et al.
Published: (2025)
HumanEvo: An Evolution-aware Benchmark for More Realistic Evaluation of Repository-level Code Generation
by: Zheng, Dewu, et al.
Published: (2024)
by: Zheng, Dewu, et al.
Published: (2024)
On the Effectiveness of Training Data Optimization for LLM-based Code Generation: An Empirical Study
by: Kuang, Shiqi, et al.
Published: (2025)
by: Kuang, Shiqi, et al.
Published: (2025)
Copilot Arena: A Platform for Code LLM Evaluation in the Wild
by: Chi, Wayne, et al.
Published: (2025)
by: Chi, Wayne, et al.
Published: (2025)
CodeScore: Evaluating Code Generation by Learning Code Execution
by: Dong, Yihong, et al.
Published: (2023)
by: Dong, Yihong, et al.
Published: (2023)
Beyond Code Generation: Assessing Code LLM Maturity with Postconditions
by: He, Fusen, et al.
Published: (2024)
by: He, Fusen, et al.
Published: (2024)
LLM Benchmarking with LLaMA2: Evaluating Code Development Performance Across Multiple Programming Languages
by: Diehl, Patrick, et al.
Published: (2025)
by: Diehl, Patrick, et al.
Published: (2025)
Similar Items
-
Benchmarking and Studying the LLM-based Code Review
by: Zeng, Zhengran, et al.
Published: (2025) -
A Survey of Code Review Benchmarks and Evaluation Practices in Pre-LLM and LLM Era
by: Khan, Taufiqul Islam, et al.
Published: (2026) -
Tests as Prompt: A Test-Driven-Development Benchmark for LLM Code Generation
by: Cui, Yi
Published: (2025) -
CodeArena: A Collective Evaluation Platform for LLM Code Generation
by: Du, Mingzhe, et al.
Published: (2025) -
Automatic Generation of Benchmarks and Reliable LLM Judgment for Code Tasks
by: Farchi, Eitan, et al.
Published: (2024)