Saved in:
| Main Authors: | Ouyang, Shuyin, Huang, Dong, Guo, Jingwen, Sun, Zeyu, Zhu, Qihao, Zhang, Jie M. |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2505.15621 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Knowledge-Enhanced Program Repair for Data Science Code
by: Ouyang, Shuyin, et al.
Published: (2025)
by: Ouyang, Shuyin, et al.
Published: (2025)
An Empirical Study of the Non-determinism of ChatGPT in Code Generation
by: Ouyang, Shuyin, et al.
Published: (2023)
by: Ouyang, Shuyin, et al.
Published: (2023)
Lyra: A Benchmark for Turducken-Style Code Generation
by: Liang, Qingyuan, et al.
Published: (2021)
by: Liang, Qingyuan, et al.
Published: (2021)
EffiBench: Benchmarking the Efficiency of Automatically Generated Code
by: Huang, Dong, et al.
Published: (2024)
by: Huang, Dong, et al.
Published: (2024)
Beyond Execution: Static-Analysis Rewards and Hint-Conditioned Diffusion RL for Code Generation
by: Ouyang, Shuyin, et al.
Published: (2026)
by: Ouyang, Shuyin, et al.
Published: (2026)
DevBench: A Realistic, Developer-Informed Benchmark for Code Generation Models
by: Kumarappan, Adarsh, et al.
Published: (2026)
by: Kumarappan, Adarsh, et al.
Published: (2026)
CupCleaner: A Hybrid Data Cleaning Approach for Comment Updating
by: Liang, Qingyuan, et al.
Published: (2023)
by: Liang, Qingyuan, et al.
Published: (2023)
SciReplicate-Bench: Benchmarking LLMs in Agent-driven Algorithmic Reproduction from Research Papers
by: Xiang, Yanzheng, et al.
Published: (2025)
by: Xiang, Yanzheng, et al.
Published: (2025)
Directional Diffusion-Style Code Editing Pre-training
by: Liang, Qingyuan, et al.
Published: (2025)
by: Liang, Qingyuan, et al.
Published: (2025)
EvoCodeBench: An Evolving Code Generation Benchmark with Domain-Specific Evaluations
by: Li, Jia, et al.
Published: (2024)
by: Li, Jia, et al.
Published: (2024)
TRACE: Evaluating Execution Efficiency of LLM-Based Code Translation
by: Gong, Zhihao, et al.
Published: (2026)
by: Gong, Zhihao, et al.
Published: (2026)
TRACE: Evaluating Execution Efficiency of LLM-Based Code Translation
by: Gong, Zhihao, et al.
Published: (2025)
by: Gong, Zhihao, et al.
Published: (2025)
QuanBench: Benchmarking Quantum Code Generation with Large Language Models
by: Guo, Xiaoyu, et al.
Published: (2025)
by: Guo, Xiaoyu, et al.
Published: (2025)
AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators
by: Chou, Jason, et al.
Published: (2025)
by: Chou, Jason, et al.
Published: (2025)
Benchmarking and Evaluating VLMs for Software Architecture Diagram Understanding
by: Ouyang, Shuyin, et al.
Published: (2026)
by: Ouyang, Shuyin, et al.
Published: (2026)
Saving SWE-Bench: A Benchmark Mutation Approach for Realistic Agent Evaluation
by: Garg, Spandan, et al.
Published: (2025)
by: Garg, Spandan, et al.
Published: (2025)
Learning to Guarantee Type Correctness in Code Generation through Type-Guided Program Synthesis
by: Huang, Zhechong, et al.
Published: (2025)
by: Huang, Zhechong, et al.
Published: (2025)
FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation
by: Li, Wei, et al.
Published: (2025)
by: Li, Wei, et al.
Published: (2025)
CodeSpecBench: Benchmarking LLMs for Executable Behavioral Specification Generation
by: Chen, Zaoyu, et al.
Published: (2026)
by: Chen, Zaoyu, et al.
Published: (2026)
Contextualized Code Pretraining for Code Generation
by: Liu, Chen, et al.
Published: (2026)
by: Liu, Chen, et al.
Published: (2026)
HumanEvo: An Evolution-aware Benchmark for More Realistic Evaluation of Repository-level Code Generation
by: Zheng, Dewu, et al.
Published: (2024)
by: Zheng, Dewu, et al.
Published: (2024)
RealBench: A Repo-Level Code Generation Benchmark Aligned with Real-World Software Development Practices
by: Li, Jia, et al.
Published: (2026)
by: Li, Jia, et al.
Published: (2026)
EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories
by: Li, Jia, et al.
Published: (2024)
by: Li, Jia, et al.
Published: (2024)
OSS-Bench: Benchmark Generator for Coding LLMs
by: Jiang, Yuancheng, et al.
Published: (2025)
by: Jiang, Yuancheng, et al.
Published: (2025)
CodeFuse-CR-Bench: A Comprehensiveness-aware Benchmark for End-to-End Code Review Evaluation in Python Projects
by: Guo, Hanyang, et al.
Published: (2025)
by: Guo, Hanyang, et al.
Published: (2025)
CodeFlowBench: A Multi-turn, Iterative Benchmark for Complex Code Generation
by: Wang, Sizhe, et al.
Published: (2025)
by: Wang, Sizhe, et al.
Published: (2025)
GramTrans: A Better Code Representation Approach in Code Generation
by: Zhang, Zhao, et al.
Published: (2025)
by: Zhang, Zhao, et al.
Published: (2025)
SimdBench: Benchmarking Large Language Models for SIMD-Intrinsic Code Generation
by: He, Yibo, et al.
Published: (2025)
by: He, Yibo, et al.
Published: (2025)
SWE-Bench+: Enhanced Coding Benchmark for LLMs
by: Aleithan, Reem, et al.
Published: (2024)
by: Aleithan, Reem, et al.
Published: (2024)
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces
by: Merrill, Mike A., et al.
Published: (2026)
by: Merrill, Mike A., et al.
Published: (2026)
SWE Context Bench: A Benchmark for Context Learning in Coding
by: Zhu, Jiayuan, et al.
Published: (2026)
by: Zhu, Jiayuan, et al.
Published: (2026)
RubberDuckBench: A Benchmark for AI Coding Assistants
by: Mohammed, Ferida, et al.
Published: (2026)
by: Mohammed, Ferida, et al.
Published: (2026)
FeatBench: Towards More Realistic Evaluation of Feature-level Code Generation
by: Chen, Haorui, et al.
Published: (2025)
by: Chen, Haorui, et al.
Published: (2025)
LAURA: Enhancing Code Review Generation with Context-Enriched Retrieval-Augmented LLM
by: Zhang, Yuxin, et al.
Published: (2025)
by: Zhang, Yuxin, et al.
Published: (2025)
CodeBenchGen: Creating Scalable Execution-based Code Generation Benchmarks
by: Xie, Yiqing, et al.
Published: (2024)
by: Xie, Yiqing, et al.
Published: (2024)
AICD Bench: A Challenging Benchmark for AI-Generated Code Detection
by: Orel, Daniil, et al.
Published: (2026)
by: Orel, Daniil, et al.
Published: (2026)
Code2Bench: Scaling Source and Rigor for Dynamic Benchmark Construction
by: Zhang, Zhe, et al.
Published: (2025)
by: Zhang, Zhe, et al.
Published: (2025)
A Survey of Large Language Models for Code: Evolution, Benchmarking, and Future Trends
by: Zheng, Zibin, et al.
Published: (2023)
by: Zheng, Zibin, et al.
Published: (2023)
Prompt Alchemy: Automatic Prompt Refinement for Enhancing Code Generation
by: Ye, Sixiang, et al.
Published: (2025)
by: Ye, Sixiang, et al.
Published: (2025)
LLMs are Bug Replicators: An Empirical Study on LLMs' Capability in Completing Bug-prone Code
by: Guo, Liwei, et al.
Published: (2025)
by: Guo, Liwei, et al.
Published: (2025)
Similar Items
-
Knowledge-Enhanced Program Repair for Data Science Code
by: Ouyang, Shuyin, et al.
Published: (2025) -
An Empirical Study of the Non-determinism of ChatGPT in Code Generation
by: Ouyang, Shuyin, et al.
Published: (2023) -
Lyra: A Benchmark for Turducken-Style Code Generation
by: Liang, Qingyuan, et al.
Published: (2021) -
EffiBench: Benchmarking the Efficiency of Automatically Generated Code
by: Huang, Dong, et al.
Published: (2024) -
Beyond Execution: Static-Analysis Rewards and Hint-Conditioned Diffusion RL for Code Generation
by: Ouyang, Shuyin, et al.
Published: (2026)