Affichage MARC: :: Library Catalog

Enregistré dans:

Détails bibliographiques
Auteurs principaux:	Shi, Jiajun, Yang, Jian, Liu, Jiaheng, Bu, Xingyuan, Chen, Jiangjie, Zhou, Junting, Ma, Kaijing, Wen, Zhoufutu, Wang, Bingli, He, Yancheng, Song, Liang, Zhu, Hualei, Li, Shilong, Wang, Xingjian, Zhang, Wei, Yuan, Ruibin, Yao, Yifan, Yang, Wenjun, Wang, Yunli, Fang, Siyuan, Yuan, Siyu, He, Qianyu, Tang, Xiangru, Tan, Yingshui, Zhou, Wangchunshu, Zhang, Zhaoxiang, Li, Zhoujun, Huang, Wenhao, Zhang, Ge
Format:	Preprint
Publié:	2025
Sujets:	Computation and Language Artificial Intelligence Machine Learning
Accès en ligne:	https://arxiv.org/abs/2505.14552
Tags:	Ajouter un tag Pas de tags, Soyez le premier à ajouter un tag!

_version_	1866913849443090432
author	Shi, Jiajun Yang, Jian Liu, Jiaheng Bu, Xingyuan Chen, Jiangjie Zhou, Junting Ma, Kaijing Wen, Zhoufutu Wang, Bingli He, Yancheng Song, Liang Zhu, Hualei Li, Shilong Wang, Xingjian Zhang, Wei Yuan, Ruibin Yao, Yifan Yang, Wenjun Wang, Yunli Fang, Siyuan Yuan, Siyu He, Qianyu Tang, Xiangru Tan, Yingshui Zhou, Wangchunshu Zhang, Zhaoxiang Li, Zhoujun Huang, Wenhao Zhang, Ge
author_facet	Shi, Jiajun Yang, Jian Liu, Jiaheng Bu, Xingyuan Chen, Jiangjie Zhou, Junting Ma, Kaijing Wen, Zhoufutu Wang, Bingli He, Yancheng Song, Liang Zhu, Hualei Li, Shilong Wang, Xingjian Zhang, Wei Yuan, Ruibin Yao, Yifan Yang, Wenjun Wang, Yunli Fang, Siyuan Yuan, Siyu He, Qianyu Tang, Xiangru Tan, Yingshui Zhou, Wangchunshu Zhang, Zhaoxiang Li, Zhoujun Huang, Wenhao Zhang, Ge
contents	Recent advancements in large language models (LLMs) underscore the need for more comprehensive evaluation methods to accurately assess their reasoning capabilities. Existing benchmarks are often domain-specific and thus cannot fully capture an LLM's general reasoning potential. To address this limitation, we introduce the Knowledge Orthogonal Reasoning Gymnasium (KORGym), a dynamic evaluation platform inspired by KOR-Bench and Gymnasium. KORGym offers over fifty games in either textual or visual formats and supports interactive, multi-turn assessments with reinforcement learning scenarios. Using KORGym, we conduct extensive experiments on 19 LLMs and 8 VLMs, revealing consistent reasoning patterns within model families and demonstrating the superior performance of closed-source models. Further analysis examines the effects of modality, reasoning strategies, reinforcement learning techniques, and response length on model performance. We expect KORGym to become a valuable resource for advancing LLM reasoning research and developing evaluation methodologies suited to complex, interactive environments.
format	Preprint
id	arxiv_https___arxiv_org_abs_2505_14552
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation Shi, Jiajun Yang, Jian Liu, Jiaheng Bu, Xingyuan Chen, Jiangjie Zhou, Junting Ma, Kaijing Wen, Zhoufutu Wang, Bingli He, Yancheng Song, Liang Zhu, Hualei Li, Shilong Wang, Xingjian Zhang, Wei Yuan, Ruibin Yao, Yifan Yang, Wenjun Wang, Yunli Fang, Siyuan Yuan, Siyu He, Qianyu Tang, Xiangru Tan, Yingshui Zhou, Wangchunshu Zhang, Zhaoxiang Li, Zhoujun Huang, Wenhao Zhang, Ge Computation and Language Artificial Intelligence Machine Learning Recent advancements in large language models (LLMs) underscore the need for more comprehensive evaluation methods to accurately assess their reasoning capabilities. Existing benchmarks are often domain-specific and thus cannot fully capture an LLM's general reasoning potential. To address this limitation, we introduce the Knowledge Orthogonal Reasoning Gymnasium (KORGym), a dynamic evaluation platform inspired by KOR-Bench and Gymnasium. KORGym offers over fifty games in either textual or visual formats and supports interactive, multi-turn assessments with reinforcement learning scenarios. Using KORGym, we conduct extensive experiments on 19 LLMs and 8 VLMs, revealing consistent reasoning patterns within model families and demonstrating the superior performance of closed-source models. Further analysis examines the effects of modality, reasoning strategies, reinforcement learning techniques, and response length on model performance. We expect KORGym to become a valuable resource for advancing LLM reasoning research and developing evaluation methodologies suited to complex, interactive environments.
title	KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation
topic	Computation and Language Artificial Intelligence Machine Learning
url	https://arxiv.org/abs/2505.14552

Documents similaires