Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhuo, Terry Yue, Jin, Xiaolong, Liu, Hange, Jiang, Juyong, Liu, Tianyang, Gong, Chen, Bishnoi, Bhupesh, Mishra, Vaisakhi, Suppa, Marek, Ziems, Noah, Utpala, Saiteja, Xu, Ming, Song, Guangyu, Li, Kaixin, Cao, Yuhan, Liu, Bo, Liu, Zheng, Abdurakhmanova, Sabina, Yu, Wenhao, Jia, Mengzhao, Yao, Jihan, Hamilton, Kenneth, Shridhar, Kumar, Vu, Minh Chien, Wang, Dingmin, Liu, Jiawei, Wang, Zijian, Liu, Qian, Hui, Binyuan, Risdal, Meg, Khaliq, Ahsen, Sood, Atin, Xing, Zhenchang, Ahmad, Wasi Uddin, Grundy, John, Lo, David, Zhu, Banghua, Du, Xiaoning, Scholak, Torsten, von Werra, Leandro
Format:	Preprint
Published:	2025
Subjects:	Software Engineering Artificial Intelligence Computation and Language
Online Access:	https://arxiv.org/abs/2510.08697
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911324801335296
author	Zhuo, Terry Yue Jin, Xiaolong Liu, Hange Jiang, Juyong Liu, Tianyang Gong, Chen Bishnoi, Bhupesh Mishra, Vaisakhi Suppa, Marek Ziems, Noah Utpala, Saiteja Xu, Ming Song, Guangyu Li, Kaixin Cao, Yuhan Liu, Bo Liu, Zheng Abdurakhmanova, Sabina Yu, Wenhao Jia, Mengzhao Yao, Jihan Hamilton, Kenneth Shridhar, Kumar Vu, Minh Chien Wang, Dingmin Liu, Jiawei Wang, Zijian Liu, Qian Hui, Binyuan Risdal, Meg Khaliq, Ahsen Sood, Atin Xing, Zhenchang Ahmad, Wasi Uddin Grundy, John Lo, David Zhu, Banghua Du, Xiaoning Scholak, Torsten von Werra, Leandro
author_facet	Zhuo, Terry Yue Jin, Xiaolong Liu, Hange Jiang, Juyong Liu, Tianyang Gong, Chen Bishnoi, Bhupesh Mishra, Vaisakhi Suppa, Marek Ziems, Noah Utpala, Saiteja Xu, Ming Song, Guangyu Li, Kaixin Cao, Yuhan Liu, Bo Liu, Zheng Abdurakhmanova, Sabina Yu, Wenhao Jia, Mengzhao Yao, Jihan Hamilton, Kenneth Shridhar, Kumar Vu, Minh Chien Wang, Dingmin Liu, Jiawei Wang, Zijian Liu, Qian Hui, Binyuan Risdal, Meg Khaliq, Ahsen Sood, Atin Xing, Zhenchang Ahmad, Wasi Uddin Grundy, John Lo, David Zhu, Banghua Du, Xiaoning Scholak, Torsten von Werra, Leandro
contents	Crowdsourced model evaluation platforms, such as Chatbot Arena, enable real-time evaluation from human perspectives to assess the quality of model responses. In the coding domain, manually examining the quality of LLM-generated content is extremely challenging, as it requires understanding long chunks of raw code and deliberately simulating code execution. To this end, we introduce BigCodeArena, an open human evaluation platform for code generation backed by a comprehensive and on-the-fly execution environment. Built on top of Chatbot Arena, BigCodeArena enables the execution of LLM-generated code and allows humans to interact with the execution process and outcomes. We collected over 14,000 raw code-centric conversation sessions across 10 widely used LLMs, spanning 10 languages and 8 types of execution environments. Among these conversations, we identified more than 4,700 multi-turn samples with pairwise human preferences. Further analysis uncovers underexplored preferences of LLMs in fine-grained domains characterized by tasks, languages, and frameworks. To systematically examine code understanding and generation capabilities of frontier LLMs, we curated two benchmarks based on the collected data, namely BigCodeReward and AutoCodeArena. For BigCodeReward, we post-processed the 4,700 conversations and evaluated the consistency between reward models and human preferences. The evaluation shows that most LLMs have superior performance in judging coding preferences when the execution results are available. Inspired by these findings, we propose AutoCodeArena, an automatic Elo rating benchmark designed to assess the coding quality of LLMs without human involvement. We find that proprietary LLMs like GPT-5, Claude-Sonnet-4, and Claude-Opus-4 still lead in code generation performance among recent emerging models.
format	Preprint
id	arxiv_https___arxiv_org_abs_2510_08697
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution Zhuo, Terry Yue Jin, Xiaolong Liu, Hange Jiang, Juyong Liu, Tianyang Gong, Chen Bishnoi, Bhupesh Mishra, Vaisakhi Suppa, Marek Ziems, Noah Utpala, Saiteja Xu, Ming Song, Guangyu Li, Kaixin Cao, Yuhan Liu, Bo Liu, Zheng Abdurakhmanova, Sabina Yu, Wenhao Jia, Mengzhao Yao, Jihan Hamilton, Kenneth Shridhar, Kumar Vu, Minh Chien Wang, Dingmin Liu, Jiawei Wang, Zijian Liu, Qian Hui, Binyuan Risdal, Meg Khaliq, Ahsen Sood, Atin Xing, Zhenchang Ahmad, Wasi Uddin Grundy, John Lo, David Zhu, Banghua Du, Xiaoning Scholak, Torsten von Werra, Leandro Software Engineering Artificial Intelligence Computation and Language Crowdsourced model evaluation platforms, such as Chatbot Arena, enable real-time evaluation from human perspectives to assess the quality of model responses. In the coding domain, manually examining the quality of LLM-generated content is extremely challenging, as it requires understanding long chunks of raw code and deliberately simulating code execution. To this end, we introduce BigCodeArena, an open human evaluation platform for code generation backed by a comprehensive and on-the-fly execution environment. Built on top of Chatbot Arena, BigCodeArena enables the execution of LLM-generated code and allows humans to interact with the execution process and outcomes. We collected over 14,000 raw code-centric conversation sessions across 10 widely used LLMs, spanning 10 languages and 8 types of execution environments. Among these conversations, we identified more than 4,700 multi-turn samples with pairwise human preferences. Further analysis uncovers underexplored preferences of LLMs in fine-grained domains characterized by tasks, languages, and frameworks. To systematically examine code understanding and generation capabilities of frontier LLMs, we curated two benchmarks based on the collected data, namely BigCodeReward and AutoCodeArena. For BigCodeReward, we post-processed the 4,700 conversations and evaluated the consistency between reward models and human preferences. The evaluation shows that most LLMs have superior performance in judging coding preferences when the execution results are available. Inspired by these findings, we propose AutoCodeArena, an automatic Elo rating benchmark designed to assess the coding quality of LLMs without human involvement. We find that proprietary LLMs like GPT-5, Claude-Sonnet-4, and Claude-Opus-4 still lead in code generation performance among recent emerging models.
title	BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution
topic	Software Engineering Artificial Intelligence Computation and Language
url	https://arxiv.org/abs/2510.08697

Similar Items