Saved in:
| Main Authors: | Ding, Xianzhong, Yu, Yangyang, Liu, Changwei, Zhao, Bill |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.24279 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
LongCodeZip: Compress Long Context for Code Language Models
by: Shi, Yuling, et al.
Published: (2025)
by: Shi, Yuling, et al.
Published: (2025)
RepoQA: Evaluating Long Context Code Understanding
by: Liu, Jiawei, et al.
Published: (2024)
by: Liu, Jiawei, et al.
Published: (2024)
YABLoCo: Yet Another Benchmark for Long Context Code Generation
by: Valeev, Aidar, et al.
Published: (2025)
by: Valeev, Aidar, et al.
Published: (2025)
DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories
by: Li, Jia, et al.
Published: (2024)
by: Li, Jia, et al.
Published: (2024)
ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development
by: Yang, Jie, et al.
Published: (2026)
by: Yang, Jie, et al.
Published: (2026)
Mercury: A Code Efficiency Benchmark for Code Large Language Models
by: Du, Mingzhe, et al.
Published: (2024)
by: Du, Mingzhe, et al.
Published: (2024)
Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses
by: Lin, Jiahang, et al.
Published: (2026)
by: Lin, Jiahang, et al.
Published: (2026)
Remember Your Trace: Memory-Guided Long-Horizon Agentic Framework for Consistent and Hierarchical Repository-Level Code Documentation
by: Bae, Suyoung, et al.
Published: (2026)
by: Bae, Suyoung, et al.
Published: (2026)
CodeFlowBench: A Multi-turn, Iterative Benchmark for Complex Code Generation
by: Wang, Sizhe, et al.
Published: (2025)
by: Wang, Sizhe, et al.
Published: (2025)
A Comprehensive Survey on Benchmarks and Solutions in Software Engineering of LLM-Empowered Agentic System
by: Guo, Jiale, et al.
Published: (2025)
by: Guo, Jiale, et al.
Published: (2025)
Is Vibe Coding Safe? Benchmarking Vulnerability of Agent-Generated Code in Real-World Tasks
by: Zhao, Songwen, et al.
Published: (2025)
by: Zhao, Songwen, et al.
Published: (2025)
EvoCodeBench: An Evolving Code Generation Benchmark with Domain-Specific Evaluations
by: Li, Jia, et al.
Published: (2024)
by: Li, Jia, et al.
Published: (2024)
Precise Debugging Benchmark: Is Your Model Debugging or Regenerating?
by: Zhu, Wang Bill, et al.
Published: (2026)
by: Zhu, Wang Bill, et al.
Published: (2026)
Vibe Coding vs. Agentic Coding: Fundamentals and Practical Implications of Agentic AI
by: Sapkota, Ranjan, et al.
Published: (2025)
by: Sapkota, Ranjan, et al.
Published: (2025)
CodeBenchGen: Creating Scalable Execution-based Code Generation Benchmarks
by: Xie, Yiqing, et al.
Published: (2024)
by: Xie, Yiqing, et al.
Published: (2024)
AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators
by: Chou, Jason, et al.
Published: (2025)
by: Chou, Jason, et al.
Published: (2025)
Automatically Benchmarking LLM Code Agents through Agent-Driven Annotation and Evaluation
by: Fu, Lingyue, et al.
Published: (2025)
by: Fu, Lingyue, et al.
Published: (2025)
IndustryCode: A Benchmark for Industry Code Generation
by: Zeng, Puyu, et al.
Published: (2026)
by: Zeng, Puyu, et al.
Published: (2026)
CodeUpdateArena: Benchmarking Knowledge Editing on API Updates
by: Liu, Zeyu Leo, et al.
Published: (2024)
by: Liu, Zeyu Leo, et al.
Published: (2024)
Is Your Benchmark (Still) Useful? Dynamic Benchmarking for Code Language Models
by: Guan, Batu, et al.
Published: (2025)
by: Guan, Batu, et al.
Published: (2025)
R2C2-Coder: Enhancing and Benchmarking Real-world Repository-level Code Completion Abilities of Code Large Language Models
by: Deng, Ken, et al.
Published: (2024)
by: Deng, Ken, et al.
Published: (2024)
Sense and Sensitivity: Examining the Influence of Semantic Recall on Long Context Code Reasoning
by: Štorek, Adam, et al.
Published: (2025)
by: Štorek, Adam, et al.
Published: (2025)
Iterative Refinement of Project-Level Code Context for Precise Code Generation with Compiler Feedback
by: Bi, Zhangqian, et al.
Published: (2024)
by: Bi, Zhangqian, et al.
Published: (2024)
OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement
by: Zheng, Tianyu, et al.
Published: (2024)
by: Zheng, Tianyu, et al.
Published: (2024)
Asymmetric Goal Drift in Coding Agents Under Value Conflict
by: Saebo, Magnus, et al.
Published: (2026)
by: Saebo, Magnus, et al.
Published: (2026)
Functional Consistency of LLM Code Embeddings: A Self-Evolving Data Synthesis Framework for Benchmarking
by: Li, Zhuohao, et al.
Published: (2025)
by: Li, Zhuohao, et al.
Published: (2025)
SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks
by: Orlanski, Gabriel, et al.
Published: (2026)
by: Orlanski, Gabriel, et al.
Published: (2026)
SEW: Self-Evolving Agentic Workflows for Automated Code Generation
by: Liu, Siwei, et al.
Published: (2025)
by: Liu, Siwei, et al.
Published: (2025)
ProjectEval: A Benchmark for Programming Agents Automated Evaluation on Project-Level Code Generation
by: Liu, Kaiyuan, et al.
Published: (2025)
by: Liu, Kaiyuan, et al.
Published: (2025)
SWE-Pruner: Self-Adaptive Context Pruning for Coding Agents
by: Wang, Yuhang, et al.
Published: (2026)
by: Wang, Yuhang, et al.
Published: (2026)
EffiBench: Benchmarking the Efficiency of Automatically Generated Code
by: Huang, Dong, et al.
Published: (2024)
by: Huang, Dong, et al.
Published: (2024)
PerfCodeBench: Benchmarking LLMs for System-Level High-Performance Code Optimization
by: Jing, Huihao, et al.
Published: (2026)
by: Jing, Huihao, et al.
Published: (2026)
FormulaCode: Evaluating Agentic Optimization on Large Codebases
by: Sehgal, Atharva, et al.
Published: (2026)
by: Sehgal, Atharva, et al.
Published: (2026)
ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation
by: Chen, Yeheng, et al.
Published: (2026)
by: Chen, Yeheng, et al.
Published: (2026)
FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation
by: Li, Wei, et al.
Published: (2025)
by: Li, Wei, et al.
Published: (2025)
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
by: Zhuo, Terry Yue, et al.
Published: (2024)
by: Zhuo, Terry Yue, et al.
Published: (2024)
From Completion to Editing: Unlocking Context-Aware Code Infilling via Search-and-Replace Instruction Tuning
by: Zhang, Jiajun, et al.
Published: (2026)
by: Zhang, Jiajun, et al.
Published: (2026)
CodeSpecBench: Benchmarking LLMs for Executable Behavioral Specification Generation
by: Chen, Zaoyu, et al.
Published: (2026)
by: Chen, Zaoyu, et al.
Published: (2026)
A Code Comprehension Benchmark for Large Language Models for Code
by: Havare, Jayant, et al.
Published: (2025)
by: Havare, Jayant, et al.
Published: (2025)
Towards an Understanding of Context Utilization in Code Intelligence
by: Wang, Yanlin, et al.
Published: (2025)
by: Wang, Yanlin, et al.
Published: (2025)
Similar Items
-
LongCodeZip: Compress Long Context for Code Language Models
by: Shi, Yuling, et al.
Published: (2025) -
RepoQA: Evaluating Long Context Code Understanding
by: Liu, Jiawei, et al.
Published: (2024) -
YABLoCo: Yet Another Benchmark for Long Context Code Generation
by: Valeev, Aidar, et al.
Published: (2025) -
DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories
by: Li, Jia, et al.
Published: (2024) -
ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development
by: Yang, Jie, et al.
Published: (2026)