Saved in:
| Main Author: | Judy, Bryce |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2412.14179 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Multimodal Approach for Harmonized System Code Prediction
by: Amel, Otmane, et al.
Published: (2024)
by: Amel, Otmane, et al.
Published: (2024)
Towards Precise Observations of Neural Model Robustness in Classification
by: Mu, Wenchuan, et al.
Published: (2024)
by: Mu, Wenchuan, et al.
Published: (2024)
Benchmarking AI Models in Software Engineering: A Review, Search Tool, and Unified Approach for Elevating Benchmark Quality
by: Koohestani, Roham, et al.
Published: (2025)
by: Koohestani, Roham, et al.
Published: (2025)
Schedule-and-Calibrate: Utility-Guided Multi-Task Reinforcement Learning for Code LLMs
by: Chen, Yujia, et al.
Published: (2026)
by: Chen, Yujia, et al.
Published: (2026)
Beyond Retrieval: A Multitask Benchmark and Model for Code Search
by: Xue, Siqiao, et al.
Published: (2026)
by: Xue, Siqiao, et al.
Published: (2026)
From Charts to Code: A Hierarchical Benchmark for Multimodal Models
by: Tang, Jiahao, et al.
Published: (2025)
by: Tang, Jiahao, et al.
Published: (2025)
Conventional Commit Classification using Large Language Models and Prompt Engineering
by: Quadir, H. M. Sazzad, et al.
Published: (2026)
by: Quadir, H. M. Sazzad, et al.
Published: (2026)
World of Workflows: A Benchmark for Bringing World Models to Enterprise Systems
by: Gupta, Lakshya, et al.
Published: (2026)
by: Gupta, Lakshya, et al.
Published: (2026)
EmbedAgent: Benchmarking Large Language Models in Embedded System Development
by: Xu, Ruiyang, et al.
Published: (2025)
by: Xu, Ruiyang, et al.
Published: (2025)
Benchmarks are Not Enough: RAMP for Runtime Assessing of Agentic Models in Production Systems
by: Ouyang, Yipeng, et al.
Published: (2026)
by: Ouyang, Yipeng, et al.
Published: (2026)
Exploring the Potential of Large Language Models in Fine-Grained Review Comment Classification
by: Nguyen, Linh, et al.
Published: (2025)
by: Nguyen, Linh, et al.
Published: (2025)
Insights from Benchmarking Frontier Language Models on Web App Code Generation
by: Cui, Yi
Published: (2024)
by: Cui, Yi
Published: (2024)
SimdBench: Benchmarking Large Language Models for SIMD-Intrinsic Code Generation
by: He, Yibo, et al.
Published: (2025)
by: He, Yibo, et al.
Published: (2025)
How Propense Are Large Language Models at Producing Code Smells? A Benchmarking Study
by: Velasco, Alejandro, et al.
Published: (2024)
by: Velasco, Alejandro, et al.
Published: (2024)
COMPASS: A Multi-Dimensional Benchmark for Evaluating Code Generation in Large Language Models
by: Meaden, James, et al.
Published: (2025)
by: Meaden, James, et al.
Published: (2025)
AdaptEval: A Benchmark for Evaluating Large Language Models on Code Snippet Adaptation
by: Zhang, Tanghaoran, et al.
Published: (2026)
by: Zhang, Tanghaoran, et al.
Published: (2026)
CABENCH: Benchmarking Composable AI for Solving Complex Tasks through Composing Ready-to-Use Models
by: Pham, Tung-Thuy, et al.
Published: (2025)
by: Pham, Tung-Thuy, et al.
Published: (2025)
Code Review Agent Benchmark
by: Zhang, Yuntong, et al.
Published: (2026)
by: Zhang, Yuntong, et al.
Published: (2026)
CONSTRUCTA: Automating Commercial Construction Schedules in Fabrication Facilities with Large Language Models
by: Zhang, Yifan, et al.
Published: (2025)
by: Zhang, Yifan, et al.
Published: (2025)
LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering
by: Qiu, Jielin, et al.
Published: (2025)
by: Qiu, Jielin, et al.
Published: (2025)
Engineering Reasoning and Instruction (ERI) Benchmark: A Large Taxonomy-driven Dataset for Foundation Models and Agents
by: Naser, MZ, et al.
Published: (2026)
by: Naser, MZ, et al.
Published: (2026)
CoCo-Bench: A Comprehensive Code Benchmark For Multi-task Large Language Model Evaluation
by: Yin, Wenjing, et al.
Published: (2025)
by: Yin, Wenjing, et al.
Published: (2025)
Software Development Life Cycle Perspective: A Survey of Benchmarks for Code Large Language Models and Agents
by: Wang, Kaixin, et al.
Published: (2025)
by: Wang, Kaixin, et al.
Published: (2025)
Benchmarking Mythos-Linked Bug Rediscovery
by: David, Isaac, et al.
Published: (2026)
by: David, Isaac, et al.
Published: (2026)
Energy-Aware Code Generation with LLMs: Benchmarking Small vs. Large Language Models for Sustainable AI Programming
by: Ashraf, Humza, et al.
Published: (2025)
by: Ashraf, Humza, et al.
Published: (2025)
How Many Tries Does It Take? Iterative Self-Repair in LLM Code Generation Across Model Scales and Benchmarks
by: Arimbur, Johin Johny
Published: (2026)
by: Arimbur, Johin Johny
Published: (2026)
CodeGolf Bench: A Multi-Language Benchmark for Evaluating Concise Code Generation Capabilities of Large Language Models
by: Padwal, Vedant
Published: (2026)
by: Padwal, Vedant
Published: (2026)
Benchmarking Correctness and Security in Multi-Turn Code Generation
by: Rawal, Ruchit, et al.
Published: (2025)
by: Rawal, Ruchit, et al.
Published: (2025)
Towards Comprehensive Benchmarking Infrastructure for LLMs In Software Engineering
by: Rodriguez-Cardenas, Daniel, et al.
Published: (2026)
by: Rodriguez-Cardenas, Daniel, et al.
Published: (2026)
Lyra: A Benchmark for Turducken-Style Code Generation
by: Liang, Qingyuan, et al.
Published: (2021)
by: Liang, Qingyuan, et al.
Published: (2021)
CodeClash: Benchmarking Goal-Oriented Software Engineering
by: Yang, John, et al.
Published: (2025)
by: Yang, John, et al.
Published: (2025)
Deep Learning for Code Intelligence: Survey, Benchmark and Toolkit
by: Wan, Yao, et al.
Published: (2023)
by: Wan, Yao, et al.
Published: (2023)
Automated Benchmark Generation for Repository-Level Coding Tasks
by: Vergopoulos, Konstantinos, et al.
Published: (2025)
by: Vergopoulos, Konstantinos, et al.
Published: (2025)
ReXCL: A Tool for Requirement Document Extraction and Classification
by: Bhattacharya, Paheli, et al.
Published: (2025)
by: Bhattacharya, Paheli, et al.
Published: (2025)
AuthAttLyzer-V2: Unveiling Code Authorship Attribution using Enhanced Ensemble Learning Models & Generating Benchmark Dataset
by: Joshi, Bhaskar, et al.
Published: (2024)
by: Joshi, Bhaskar, et al.
Published: (2024)
Benchmarks and Metrics for Evaluations of Code Generation: A Critical Review
by: Paul, Debalina Ghosh, et al.
Published: (2024)
by: Paul, Debalina Ghosh, et al.
Published: (2024)
DOMAINEVAL: An Auto-Constructed Benchmark for Multi-Domain Code Generation
by: Zhu, Qiming, et al.
Published: (2024)
by: Zhu, Qiming, et al.
Published: (2024)
Re-Evaluating Code LLM Benchmarks Under Semantic Mutation
by: Pan, Zhiyuan, et al.
Published: (2025)
by: Pan, Zhiyuan, et al.
Published: (2025)
A New Benchmark for the Appropriate Evaluation of RTL Code Optimization
by: Lu, Yao, et al.
Published: (2026)
by: Lu, Yao, et al.
Published: (2026)
Edit, But Verify: An Empirical Audit of Instructed Code-Editing Benchmarks
by: Ebrahimi, Amir M., et al.
Published: (2026)
by: Ebrahimi, Amir M., et al.
Published: (2026)
Similar Items
-
Multimodal Approach for Harmonized System Code Prediction
by: Amel, Otmane, et al.
Published: (2024) -
Towards Precise Observations of Neural Model Robustness in Classification
by: Mu, Wenchuan, et al.
Published: (2024) -
Benchmarking AI Models in Software Engineering: A Review, Search Tool, and Unified Approach for Elevating Benchmark Quality
by: Koohestani, Roham, et al.
Published: (2025) -
Schedule-and-Calibrate: Utility-Guided Multi-Task Reinforcement Learning for Code LLMs
by: Chen, Yujia, et al.
Published: (2026) -
Beyond Retrieval: A Multitask Benchmark and Model for Code Search
by: Xue, Siqiao, et al.
Published: (2026)