Saved in:
| Main Authors: | Guan, Batu, Wu, Xiao, Yuan, Yuanyuan, Li, Shaohua |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2503.06643 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Mercury: A Code Efficiency Benchmark for Code Large Language Models
by: Du, Mingzhe, et al.
Published: (2024)
by: Du, Mingzhe, et al.
Published: (2024)
AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators
by: Chou, Jason, et al.
Published: (2025)
by: Chou, Jason, et al.
Published: (2025)
Precise Debugging Benchmark: Is Your Model Debugging or Regenerating?
by: Zhu, Wang Bill, et al.
Published: (2026)
by: Zhu, Wang Bill, et al.
Published: (2026)
CodeSpecBench: Benchmarking LLMs for Executable Behavioral Specification Generation
by: Chen, Zaoyu, et al.
Published: (2026)
by: Chen, Zaoyu, et al.
Published: (2026)
Iterative Refinement of Project-Level Code Context for Precise Code Generation with Compiler Feedback
by: Bi, Zhangqian, et al.
Published: (2024)
by: Bi, Zhangqian, et al.
Published: (2024)
A Code Comprehension Benchmark for Large Language Models for Code
by: Havare, Jayant, et al.
Published: (2025)
by: Havare, Jayant, et al.
Published: (2025)
Isolating Language-Coding from Problem-Solving: Benchmarking LLMs with PseudoEval
by: Wu, Jiarong, et al.
Published: (2025)
by: Wu, Jiarong, et al.
Published: (2025)
What's Wrong with Your Code Generated by Large Language Models? An Extensive Study
by: Dou, Shihan, et al.
Published: (2024)
by: Dou, Shihan, et al.
Published: (2024)
Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination
by: Chen, Simin, et al.
Published: (2025)
by: Chen, Simin, et al.
Published: (2025)
Collu-Bench: A Benchmark for Predicting Language Model Hallucinations in Code
by: Jiang, Nan, et al.
Published: (2024)
by: Jiang, Nan, et al.
Published: (2024)
R2C2-Coder: Enhancing and Benchmarking Real-world Repository-level Code Completion Abilities of Code Large Language Models
by: Deng, Ken, et al.
Published: (2024)
by: Deng, Ken, et al.
Published: (2024)
EvoCodeBench: An Evolving Code Generation Benchmark with Domain-Specific Evaluations
by: Li, Jia, et al.
Published: (2024)
by: Li, Jia, et al.
Published: (2024)
Automatically Benchmarking LLM Code Agents through Agent-Driven Annotation and Evaluation
by: Fu, Lingyue, et al.
Published: (2025)
by: Fu, Lingyue, et al.
Published: (2025)
MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use
by: Huang, Yue, et al.
Published: (2023)
by: Huang, Yue, et al.
Published: (2023)
Is Vibe Coding Safe? Benchmarking Vulnerability of Agent-Generated Code in Real-World Tasks
by: Zhao, Songwen, et al.
Published: (2025)
by: Zhao, Songwen, et al.
Published: (2025)
Benchmarking Failures in Tool-Augmented Language Models
by: Treviño, Eduardo, et al.
Published: (2025)
by: Treviño, Eduardo, et al.
Published: (2025)
CodeFlowBench: A Multi-turn, Iterative Benchmark for Complex Code Generation
by: Wang, Sizhe, et al.
Published: (2025)
by: Wang, Sizhe, et al.
Published: (2025)
Large Language Models are Qualified Benchmark Builders: Rebuilding Pre-Training Datasets for Advancing Code Intelligence Tasks
by: Yang, Kang, et al.
Published: (2025)
by: Yang, Kang, et al.
Published: (2025)
Is Your AI-Generated Code Really Safe? Evaluating Large Language Models on Secure Code Generation with CodeSecEval
by: Wang, Jiexin, et al.
Published: (2024)
by: Wang, Jiexin, et al.
Published: (2024)
HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization
by: Peng, Qiwei, et al.
Published: (2024)
by: Peng, Qiwei, et al.
Published: (2024)
DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories
by: Li, Jia, et al.
Published: (2024)
by: Li, Jia, et al.
Published: (2024)
EffiBench: Benchmarking the Efficiency of Automatically Generated Code
by: Huang, Dong, et al.
Published: (2024)
by: Huang, Dong, et al.
Published: (2024)
CodeBenchGen: Creating Scalable Execution-based Code Generation Benchmarks
by: Xie, Yiqing, et al.
Published: (2024)
by: Xie, Yiqing, et al.
Published: (2024)
Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models
by: Zheng, Jiasheng, et al.
Published: (2024)
by: Zheng, Jiasheng, et al.
Published: (2024)
PerfCodeBench: Benchmarking LLMs for System-Level High-Performance Code Optimization
by: Jing, Huihao, et al.
Published: (2026)
by: Jing, Huihao, et al.
Published: (2026)
Uncertainty Awareness of Large Language Models Under Code Distribution Shifts: A Benchmark Study
by: Li, Yufei, et al.
Published: (2024)
by: Li, Yufei, et al.
Published: (2024)
CODEMENV: Benchmarking Large Language Models on Code Migration
by: Cheng, Keyuan, et al.
Published: (2025)
by: Cheng, Keyuan, et al.
Published: (2025)
CodeUpdateArena: Benchmarking Knowledge Editing on API Updates
by: Liu, Zeyu Leo, et al.
Published: (2024)
by: Liu, Zeyu Leo, et al.
Published: (2024)
FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation
by: Li, Wei, et al.
Published: (2025)
by: Li, Wei, et al.
Published: (2025)
E2Edev: Benchmarking Large Language Models in End-to-End Software Development Task
by: Liu, Jingyao, et al.
Published: (2025)
by: Liu, Jingyao, et al.
Published: (2025)
EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories
by: Li, Jia, et al.
Published: (2024)
by: Li, Jia, et al.
Published: (2024)
IndustryCode: A Benchmark for Industry Code Generation
by: Zeng, Puyu, et al.
Published: (2026)
by: Zeng, Puyu, et al.
Published: (2026)
Benchmarking LLM Code Generation for Audio Programming with Visual Dataflow Languages
by: Zhang, William, et al.
Published: (2024)
by: Zhang, William, et al.
Published: (2024)
ProjectEval: A Benchmark for Programming Agents Automated Evaluation on Project-Level Code Generation
by: Liu, Kaiyuan, et al.
Published: (2025)
by: Liu, Kaiyuan, et al.
Published: (2025)
DI-BENCH: Benchmarking Large Language Models on Dependency Inference with Testable Repositories at Scale
by: Zhang, Linghao, et al.
Published: (2025)
by: Zhang, Linghao, et al.
Published: (2025)
MMCode: Benchmarking Multimodal Large Language Models for Code Generation with Visually Rich Programming Problems
by: Li, Kaixin, et al.
Published: (2024)
by: Li, Kaixin, et al.
Published: (2024)
ContextEcho: A Benchmark for Persona Drift in Long Agentic-Coding Sessions
by: Ding, Xianzhong, et al.
Published: (2026)
by: Ding, Xianzhong, et al.
Published: (2026)
Functional Consistency of LLM Code Embeddings: A Self-Evolving Data Synthesis Framework for Benchmarking
by: Li, Zhuohao, et al.
Published: (2025)
by: Li, Zhuohao, et al.
Published: (2025)
SWE-Chain: Benchmarking Coding Agents on Chained Release-Level Package Upgrades
by: Lam, Man Ho, et al.
Published: (2026)
by: Lam, Man Ho, et al.
Published: (2026)
ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation
by: Chen, Yeheng, et al.
Published: (2026)
by: Chen, Yeheng, et al.
Published: (2026)
Similar Items
-
Mercury: A Code Efficiency Benchmark for Code Large Language Models
by: Du, Mingzhe, et al.
Published: (2024) -
AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators
by: Chou, Jason, et al.
Published: (2025) -
Precise Debugging Benchmark: Is Your Model Debugging or Regenerating?
by: Zhu, Wang Bill, et al.
Published: (2026) -
CodeSpecBench: Benchmarking LLMs for Executable Behavioral Specification Generation
by: Chen, Zaoyu, et al.
Published: (2026) -
Iterative Refinement of Project-Level Code Context for Precise Code Generation with Compiler Feedback
by: Bi, Zhangqian, et al.
Published: (2024)