Saved in:
Bibliographic Details
Main Authors: Li, Lingfeng, Lu, Yunlong, Zhang, Yuefei, Yao, Jingyu, Zhu, Yixin, Cheng, KeYuan, Wang, Yongyi, Zheng, Qirui, Yang, Xionghui, Li, Wenxin
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.13214
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866912904149729280
author Li, Lingfeng
Lu, Yunlong
Zhang, Yuefei
Yao, Jingyu
Zhu, Yixin
Cheng, KeYuan
Wang, Yongyi
Zheng, Qirui
Yang, Xionghui
Li, Wenxin
author_facet Li, Lingfeng
Lu, Yunlong
Zhang, Yuefei
Yao, Jingyu
Zhu, Yixin
Cheng, KeYuan
Wang, Yongyi
Zheng, Qirui
Yang, Xionghui
Li, Wenxin
contents Large Language Models (LLMs) are increasingly deployed in interactive environments requiring strategic decision-making, yet systematic evaluation of these capabilities remains challenging. Existing benchmarks for LLMs primarily assess static reasoning through isolated tasks and fail to capture dynamic strategic abilities. Recent game-based evaluations employ LLM-vs-LLM tournaments that produce relative rankings dependent on transient model pools, incurring quadratic computational costs and lacking stable performance anchors for longitudinal tracking. The central challenge is establishing a scalable evaluation framework that measures LLM strategic reasoning against consistent, interpretable standards rather than volatile peer models. Here we show that anchoring LLM evaluation to fixed hierarchies of skill-calibrated game Artificial Intelligence (AI) enables linear-time absolute skill measurement with stable cross-temporal interpretability. Built on the Botzone platform's established competitive infrastructure, our BotzoneBench evaluates LLMs across eight diverse games spanning deterministic perfect-information board games to stochastic imperfect-information card games. Through systematic assessment of 177,047 state-action pairs from five flagship models, we reveal significant performance disparities and identify distinct strategic behaviors, with top-performing models achieving proficiency comparable to mid-to-high-tier specialized game AI in multiple domains. This anchored evaluation paradigm generalizes beyond games to any domain with well-defined skill hierarchies, establishing a scalable and reusable framework for assessing interactive AI capabilities.
format Preprint
id arxiv_https___arxiv_org_abs_2602_13214
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle BotzoneBench: Scalable LLM Evaluation via Graded AI Anchors
Li, Lingfeng
Lu, Yunlong
Zhang, Yuefei
Yao, Jingyu
Zhu, Yixin
Cheng, KeYuan
Wang, Yongyi
Zheng, Qirui
Yang, Xionghui
Li, Wenxin
Artificial Intelligence
Large Language Models (LLMs) are increasingly deployed in interactive environments requiring strategic decision-making, yet systematic evaluation of these capabilities remains challenging. Existing benchmarks for LLMs primarily assess static reasoning through isolated tasks and fail to capture dynamic strategic abilities. Recent game-based evaluations employ LLM-vs-LLM tournaments that produce relative rankings dependent on transient model pools, incurring quadratic computational costs and lacking stable performance anchors for longitudinal tracking. The central challenge is establishing a scalable evaluation framework that measures LLM strategic reasoning against consistent, interpretable standards rather than volatile peer models. Here we show that anchoring LLM evaluation to fixed hierarchies of skill-calibrated game Artificial Intelligence (AI) enables linear-time absolute skill measurement with stable cross-temporal interpretability. Built on the Botzone platform's established competitive infrastructure, our BotzoneBench evaluates LLMs across eight diverse games spanning deterministic perfect-information board games to stochastic imperfect-information card games. Through systematic assessment of 177,047 state-action pairs from five flagship models, we reveal significant performance disparities and identify distinct strategic behaviors, with top-performing models achieving proficiency comparable to mid-to-high-tier specialized game AI in multiple domains. This anchored evaluation paradigm generalizes beyond games to any domain with well-defined skill hierarchies, establishing a scalable and reusable framework for assessing interactive AI capabilities.
title BotzoneBench: Scalable LLM Evaluation via Graded AI Anchors
topic Artificial Intelligence
url https://arxiv.org/abs/2602.13214