Saved in:
| Main Authors: | Gramacki, Piotr, Martins, Bruno, Szymański, Piotr |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2410.04617 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
A Vision for Geo-Temporal Deep Research Systems: Towards Comprehensive, Transparent, and Reproducible Geo-Temporal Information Synthesis
by: Martins, Bruno, et al.
Published: (2025)
by: Martins, Bruno, et al.
Published: (2025)
FairCoder: Evaluating Social Bias of LLMs in Code Generation
by: Du, Yongkang, et al.
Published: (2025)
by: Du, Yongkang, et al.
Published: (2025)
Enhancing Code LLMs with Reinforcement Learning in Code Generation: A Survey
by: Wang, Junqiao, et al.
Published: (2024)
by: Wang, Junqiao, et al.
Published: (2024)
Code-Vision: Evaluating Multimodal LLMs Logic Understanding and Code Generation Capabilities
by: Wang, Hanbin, et al.
Published: (2025)
by: Wang, Hanbin, et al.
Published: (2025)
From Output to Evaluation: Does Raw Instruction-Tuned Code LLMs Output Suffice for Fill-in-the-Middle Code Generation?
by: Ahmad, Wasi Uddin, et al.
Published: (2025)
by: Ahmad, Wasi Uddin, et al.
Published: (2025)
Chain-of-Programming (CoP) : Empowering Large Language Models for Geospatial Code Generation
by: Hou, Shuyang, et al.
Published: (2024)
by: Hou, Shuyang, et al.
Published: (2024)
Is Your AI-Generated Code Really Safe? Evaluating Large Language Models on Secure Code Generation with CodeSecEval
by: Wang, Jiexin, et al.
Published: (2024)
by: Wang, Jiexin, et al.
Published: (2024)
EvoCodeBench: An Evolving Code Generation Benchmark with Domain-Specific Evaluations
by: Li, Jia, et al.
Published: (2024)
by: Li, Jia, et al.
Published: (2024)
CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation
by: Yan, Weixiang, et al.
Published: (2023)
by: Yan, Weixiang, et al.
Published: (2023)
AetherCode: Evaluating LLMs' Ability to Win In Premier Programming Competitions
by: Wang, Zihan, et al.
Published: (2025)
by: Wang, Zihan, et al.
Published: (2025)
Showing LLM-Generated Code Selectively Based on Confidence of LLMs
by: Li, Jia, et al.
Published: (2024)
by: Li, Jia, et al.
Published: (2024)
Evaluating and Achieving Controllable Code Completion in Code LLM
by: Zhang, Jiajun, et al.
Published: (2026)
by: Zhang, Jiajun, et al.
Published: (2026)
CodeHalu: Investigating Code Hallucinations in LLMs via Execution-based Verification
by: Tian, Yuchen, et al.
Published: (2024)
by: Tian, Yuchen, et al.
Published: (2024)
CodeSpecBench: Benchmarking LLMs for Executable Behavioral Specification Generation
by: Chen, Zaoyu, et al.
Published: (2026)
by: Chen, Zaoyu, et al.
Published: (2026)
OpenCodeInstruct: A Large-scale Instruction Tuning Dataset for Code LLMs
by: Ahmad, Wasi Uddin, et al.
Published: (2025)
by: Ahmad, Wasi Uddin, et al.
Published: (2025)
CodeJudge: Evaluating Code Generation with Large Language Models
by: Tong, Weixi, et al.
Published: (2024)
by: Tong, Weixi, et al.
Published: (2024)
VersiCode: Towards Version-controllable Code Generation
by: Wu, Tongtong, et al.
Published: (2024)
by: Wu, Tongtong, et al.
Published: (2024)
Code Fingerprints: Disentangled Attribution of LLM-Generated Code
by: Guo, Jiaxun, et al.
Published: (2026)
by: Guo, Jiaxun, et al.
Published: (2026)
CONCUR: Benchmarking LLMs for Concurrent Code Generation
by: Huang, Jue, et al.
Published: (2026)
by: Huang, Jue, et al.
Published: (2026)
LeetCodeDataset: A Temporal Dataset for Robust Evaluation and Efficient Training of Code LLMs
by: Xia, Yunhui, et al.
Published: (2025)
by: Xia, Yunhui, et al.
Published: (2025)
Humanity's Last Code Exam: Can Advanced LLMs Conquer Human's Hardest Code Competition?
by: Li, Xiangyang, et al.
Published: (2025)
by: Li, Xiangyang, et al.
Published: (2025)
CodeRAG-Bench: Can Retrieval Augment Code Generation?
by: Wang, Zora Zhiruo, et al.
Published: (2024)
by: Wang, Zora Zhiruo, et al.
Published: (2024)
PerfCodeBench: Benchmarking LLMs for System-Level High-Performance Code Optimization
by: Jing, Huihao, et al.
Published: (2026)
by: Jing, Huihao, et al.
Published: (2026)
LLMs for Science: Usage for Code Generation and Data Analysis
by: Nejjar, Mohamed, et al.
Published: (2023)
by: Nejjar, Mohamed, et al.
Published: (2023)
Don't Judge Code by Its Cover: Exploring Biases in LLM Judges for Code Evaluation
by: Moon, Jiwon, et al.
Published: (2025)
by: Moon, Jiwon, et al.
Published: (2025)
MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks
by: Chervyakov, Artem, et al.
Published: (2025)
by: Chervyakov, Artem, et al.
Published: (2025)
CodeBenchGen: Creating Scalable Execution-based Code Generation Benchmarks
by: Xie, Yiqing, et al.
Published: (2024)
by: Xie, Yiqing, et al.
Published: (2024)
AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators
by: Chou, Jason, et al.
Published: (2025)
by: Chou, Jason, et al.
Published: (2025)
ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation
by: Zhang, Chenchen, et al.
Published: (2025)
by: Zhang, Chenchen, et al.
Published: (2025)
NLPerturbator: Studying the Robustness of Code LLMs to Natural Language Variations
by: Chen, Junkai, et al.
Published: (2024)
by: Chen, Junkai, et al.
Published: (2024)
Model Editing for LLMs4Code: How Far are We?
by: Li, Xiaopeng, et al.
Published: (2024)
by: Li, Xiaopeng, et al.
Published: (2024)
Coffee: Boost Your Code LLMs by Fixing Bugs with Feedback
by: Moon, Seungjun, et al.
Published: (2023)
by: Moon, Seungjun, et al.
Published: (2023)
Comparing Developer and LLM Biases in Code Evaluation
by: Mittal, Aditya, et al.
Published: (2026)
by: Mittal, Aditya, et al.
Published: (2026)
Evaluating Language Models for Efficient Code Generation
by: Liu, Jiawei, et al.
Published: (2024)
by: Liu, Jiawei, et al.
Published: (2024)
Iterative Refinement of Project-Level Code Context for Precise Code Generation with Compiler Feedback
by: Bi, Zhangqian, et al.
Published: (2024)
by: Bi, Zhangqian, et al.
Published: (2024)
Is Vibe Coding Safe? Benchmarking Vulnerability of Agent-Generated Code in Real-World Tasks
by: Zhao, Songwen, et al.
Published: (2025)
by: Zhao, Songwen, et al.
Published: (2025)
CodeFlowBench: A Multi-turn, Iterative Benchmark for Complex Code Generation
by: Wang, Sizhe, et al.
Published: (2025)
by: Wang, Sizhe, et al.
Published: (2025)
FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation
by: Li, Wei, et al.
Published: (2025)
by: Li, Wei, et al.
Published: (2025)
AutoGEEval: A Multimodal and Automated Framework for Geospatial Code Generation on GEE with Large Language Models
by: Hou, Shuyang, et al.
Published: (2025)
by: Hou, Shuyang, et al.
Published: (2025)
UA-Code-Bench: A Competitive Programming Benchmark for Evaluating LLM Code Generation in Ukrainian
by: Syromiatnikov, Mykyta, et al.
Published: (2025)
by: Syromiatnikov, Mykyta, et al.
Published: (2025)
Similar Items
-
A Vision for Geo-Temporal Deep Research Systems: Towards Comprehensive, Transparent, and Reproducible Geo-Temporal Information Synthesis
by: Martins, Bruno, et al.
Published: (2025) -
FairCoder: Evaluating Social Bias of LLMs in Code Generation
by: Du, Yongkang, et al.
Published: (2025) -
Enhancing Code LLMs with Reinforcement Learning in Code Generation: A Survey
by: Wang, Junqiao, et al.
Published: (2024) -
Code-Vision: Evaluating Multimodal LLMs Logic Understanding and Code Generation Capabilities
by: Wang, Hanbin, et al.
Published: (2025) -
From Output to Evaluation: Does Raw Instruction-Tuned Code LLMs Output Suffice for Fill-in-the-Middle Code Generation?
by: Ahmad, Wasi Uddin, et al.
Published: (2025)