Saved in:
| Main Authors: | , , , , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.13214 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866912904149729280 |
|---|---|
| author | Li, Lingfeng Lu, Yunlong Zhang, Yuefei Yao, Jingyu Zhu, Yixin Cheng, KeYuan Wang, Yongyi Zheng, Qirui Yang, Xionghui Li, Wenxin |
| author_facet | Li, Lingfeng Lu, Yunlong Zhang, Yuefei Yao, Jingyu Zhu, Yixin Cheng, KeYuan Wang, Yongyi Zheng, Qirui Yang, Xionghui Li, Wenxin |
| contents | Large Language Models (LLMs) are increasingly deployed in interactive environments requiring strategic decision-making, yet systematic evaluation of these capabilities remains challenging. Existing benchmarks for LLMs primarily assess static reasoning through isolated tasks and fail to capture dynamic strategic abilities. Recent game-based evaluations employ LLM-vs-LLM tournaments that produce relative rankings dependent on transient model pools, incurring quadratic computational costs and lacking stable performance anchors for longitudinal tracking. The central challenge is establishing a scalable evaluation framework that measures LLM strategic reasoning against consistent, interpretable standards rather than volatile peer models. Here we show that anchoring LLM evaluation to fixed hierarchies of skill-calibrated game Artificial Intelligence (AI) enables linear-time absolute skill measurement with stable cross-temporal interpretability. Built on the Botzone platform's established competitive infrastructure, our BotzoneBench evaluates LLMs across eight diverse games spanning deterministic perfect-information board games to stochastic imperfect-information card games. Through systematic assessment of 177,047 state-action pairs from five flagship models, we reveal significant performance disparities and identify distinct strategic behaviors, with top-performing models achieving proficiency comparable to mid-to-high-tier specialized game AI in multiple domains. This anchored evaluation paradigm generalizes beyond games to any domain with well-defined skill hierarchies, establishing a scalable and reusable framework for assessing interactive AI capabilities. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2602_13214 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | BotzoneBench: Scalable LLM Evaluation via Graded AI Anchors Li, Lingfeng Lu, Yunlong Zhang, Yuefei Yao, Jingyu Zhu, Yixin Cheng, KeYuan Wang, Yongyi Zheng, Qirui Yang, Xionghui Li, Wenxin Artificial Intelligence Large Language Models (LLMs) are increasingly deployed in interactive environments requiring strategic decision-making, yet systematic evaluation of these capabilities remains challenging. Existing benchmarks for LLMs primarily assess static reasoning through isolated tasks and fail to capture dynamic strategic abilities. Recent game-based evaluations employ LLM-vs-LLM tournaments that produce relative rankings dependent on transient model pools, incurring quadratic computational costs and lacking stable performance anchors for longitudinal tracking. The central challenge is establishing a scalable evaluation framework that measures LLM strategic reasoning against consistent, interpretable standards rather than volatile peer models. Here we show that anchoring LLM evaluation to fixed hierarchies of skill-calibrated game Artificial Intelligence (AI) enables linear-time absolute skill measurement with stable cross-temporal interpretability. Built on the Botzone platform's established competitive infrastructure, our BotzoneBench evaluates LLMs across eight diverse games spanning deterministic perfect-information board games to stochastic imperfect-information card games. Through systematic assessment of 177,047 state-action pairs from five flagship models, we reveal significant performance disparities and identify distinct strategic behaviors, with top-performing models achieving proficiency comparable to mid-to-high-tier specialized game AI in multiple domains. This anchored evaluation paradigm generalizes beyond games to any domain with well-defined skill hierarchies, establishing a scalable and reusable framework for assessing interactive AI capabilities. |
| title | BotzoneBench: Scalable LLM Evaluation via Graded AI Anchors |
| topic | Artificial Intelligence |
| url | https://arxiv.org/abs/2602.13214 |