Enregistré dans:
Détails bibliographiques
Auteurs principaux: Shi, Jiajun, Yang, Jian, Liu, Jiaheng, Bu, Xingyuan, Chen, Jiangjie, Zhou, Junting, Ma, Kaijing, Wen, Zhoufutu, Wang, Bingli, He, Yancheng, Song, Liang, Zhu, Hualei, Li, Shilong, Wang, Xingjian, Zhang, Wei, Yuan, Ruibin, Yao, Yifan, Yang, Wenjun, Wang, Yunli, Fang, Siyuan, Yuan, Siyu, He, Qianyu, Tang, Xiangru, Tan, Yingshui, Zhou, Wangchunshu, Zhang, Zhaoxiang, Li, Zhoujun, Huang, Wenhao, Zhang, Ge
Format: Preprint
Publié: 2025
Sujets:
Accès en ligne:https://arxiv.org/abs/2505.14552
Tags: Ajouter un tag
Pas de tags, Soyez le premier à ajouter un tag!
_version_ 1866913849443090432
author Shi, Jiajun
Yang, Jian
Liu, Jiaheng
Bu, Xingyuan
Chen, Jiangjie
Zhou, Junting
Ma, Kaijing
Wen, Zhoufutu
Wang, Bingli
He, Yancheng
Song, Liang
Zhu, Hualei
Li, Shilong
Wang, Xingjian
Zhang, Wei
Yuan, Ruibin
Yao, Yifan
Yang, Wenjun
Wang, Yunli
Fang, Siyuan
Yuan, Siyu
He, Qianyu
Tang, Xiangru
Tan, Yingshui
Zhou, Wangchunshu
Zhang, Zhaoxiang
Li, Zhoujun
Huang, Wenhao
Zhang, Ge
author_facet Shi, Jiajun
Yang, Jian
Liu, Jiaheng
Bu, Xingyuan
Chen, Jiangjie
Zhou, Junting
Ma, Kaijing
Wen, Zhoufutu
Wang, Bingli
He, Yancheng
Song, Liang
Zhu, Hualei
Li, Shilong
Wang, Xingjian
Zhang, Wei
Yuan, Ruibin
Yao, Yifan
Yang, Wenjun
Wang, Yunli
Fang, Siyuan
Yuan, Siyu
He, Qianyu
Tang, Xiangru
Tan, Yingshui
Zhou, Wangchunshu
Zhang, Zhaoxiang
Li, Zhoujun
Huang, Wenhao
Zhang, Ge
contents Recent advancements in large language models (LLMs) underscore the need for more comprehensive evaluation methods to accurately assess their reasoning capabilities. Existing benchmarks are often domain-specific and thus cannot fully capture an LLM's general reasoning potential. To address this limitation, we introduce the Knowledge Orthogonal Reasoning Gymnasium (KORGym), a dynamic evaluation platform inspired by KOR-Bench and Gymnasium. KORGym offers over fifty games in either textual or visual formats and supports interactive, multi-turn assessments with reinforcement learning scenarios. Using KORGym, we conduct extensive experiments on 19 LLMs and 8 VLMs, revealing consistent reasoning patterns within model families and demonstrating the superior performance of closed-source models. Further analysis examines the effects of modality, reasoning strategies, reinforcement learning techniques, and response length on model performance. We expect KORGym to become a valuable resource for advancing LLM reasoning research and developing evaluation methodologies suited to complex, interactive environments.
format Preprint
id arxiv_https___arxiv_org_abs_2505_14552
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation
Shi, Jiajun
Yang, Jian
Liu, Jiaheng
Bu, Xingyuan
Chen, Jiangjie
Zhou, Junting
Ma, Kaijing
Wen, Zhoufutu
Wang, Bingli
He, Yancheng
Song, Liang
Zhu, Hualei
Li, Shilong
Wang, Xingjian
Zhang, Wei
Yuan, Ruibin
Yao, Yifan
Yang, Wenjun
Wang, Yunli
Fang, Siyuan
Yuan, Siyu
He, Qianyu
Tang, Xiangru
Tan, Yingshui
Zhou, Wangchunshu
Zhang, Zhaoxiang
Li, Zhoujun
Huang, Wenhao
Zhang, Ge
Computation and Language
Artificial Intelligence
Machine Learning
Recent advancements in large language models (LLMs) underscore the need for more comprehensive evaluation methods to accurately assess their reasoning capabilities. Existing benchmarks are often domain-specific and thus cannot fully capture an LLM's general reasoning potential. To address this limitation, we introduce the Knowledge Orthogonal Reasoning Gymnasium (KORGym), a dynamic evaluation platform inspired by KOR-Bench and Gymnasium. KORGym offers over fifty games in either textual or visual formats and supports interactive, multi-turn assessments with reinforcement learning scenarios. Using KORGym, we conduct extensive experiments on 19 LLMs and 8 VLMs, revealing consistent reasoning patterns within model families and demonstrating the superior performance of closed-source models. Further analysis examines the effects of modality, reasoning strategies, reinforcement learning techniques, and response length on model performance. We expect KORGym to become a valuable resource for advancing LLM reasoning research and developing evaluation methodologies suited to complex, interactive environments.
title KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation
topic Computation and Language
Artificial Intelligence
Machine Learning
url https://arxiv.org/abs/2505.14552