Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Li, Lingfeng, Lu, Yunlong, Zhang, Yuefei, Yao, Jingyu, Zhu, Yixin, Cheng, KeYuan, Wang, Yongyi, Zheng, Qirui, Yang, Xionghui, Li, Wenxin
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2602.13214
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912904149729280
author	Li, Lingfeng Lu, Yunlong Zhang, Yuefei Yao, Jingyu Zhu, Yixin Cheng, KeYuan Wang, Yongyi Zheng, Qirui Yang, Xionghui Li, Wenxin
author_facet	Li, Lingfeng Lu, Yunlong Zhang, Yuefei Yao, Jingyu Zhu, Yixin Cheng, KeYuan Wang, Yongyi Zheng, Qirui Yang, Xionghui Li, Wenxin
contents	Large Language Models (LLMs) are increasingly deployed in interactive environments requiring strategic decision-making, yet systematic evaluation of these capabilities remains challenging. Existing benchmarks for LLMs primarily assess static reasoning through isolated tasks and fail to capture dynamic strategic abilities. Recent game-based evaluations employ LLM-vs-LLM tournaments that produce relative rankings dependent on transient model pools, incurring quadratic computational costs and lacking stable performance anchors for longitudinal tracking. The central challenge is establishing a scalable evaluation framework that measures LLM strategic reasoning against consistent, interpretable standards rather than volatile peer models. Here we show that anchoring LLM evaluation to fixed hierarchies of skill-calibrated game Artificial Intelligence (AI) enables linear-time absolute skill measurement with stable cross-temporal interpretability. Built on the Botzone platform's established competitive infrastructure, our BotzoneBench evaluates LLMs across eight diverse games spanning deterministic perfect-information board games to stochastic imperfect-information card games. Through systematic assessment of 177,047 state-action pairs from five flagship models, we reveal significant performance disparities and identify distinct strategic behaviors, with top-performing models achieving proficiency comparable to mid-to-high-tier specialized game AI in multiple domains. This anchored evaluation paradigm generalizes beyond games to any domain with well-defined skill hierarchies, establishing a scalable and reusable framework for assessing interactive AI capabilities.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_13214
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	BotzoneBench: Scalable LLM Evaluation via Graded AI Anchors Li, Lingfeng Lu, Yunlong Zhang, Yuefei Yao, Jingyu Zhu, Yixin Cheng, KeYuan Wang, Yongyi Zheng, Qirui Yang, Xionghui Li, Wenxin Artificial Intelligence Large Language Models (LLMs) are increasingly deployed in interactive environments requiring strategic decision-making, yet systematic evaluation of these capabilities remains challenging. Existing benchmarks for LLMs primarily assess static reasoning through isolated tasks and fail to capture dynamic strategic abilities. Recent game-based evaluations employ LLM-vs-LLM tournaments that produce relative rankings dependent on transient model pools, incurring quadratic computational costs and lacking stable performance anchors for longitudinal tracking. The central challenge is establishing a scalable evaluation framework that measures LLM strategic reasoning against consistent, interpretable standards rather than volatile peer models. Here we show that anchoring LLM evaluation to fixed hierarchies of skill-calibrated game Artificial Intelligence (AI) enables linear-time absolute skill measurement with stable cross-temporal interpretability. Built on the Botzone platform's established competitive infrastructure, our BotzoneBench evaluates LLMs across eight diverse games spanning deterministic perfect-information board games to stochastic imperfect-information card games. Through systematic assessment of 177,047 state-action pairs from five flagship models, we reveal significant performance disparities and identify distinct strategic behaviors, with top-performing models achieving proficiency comparable to mid-to-high-tier specialized game AI in multiple domains. This anchored evaluation paradigm generalizes beyond games to any domain with well-defined skill hierarchies, establishing a scalable and reusable framework for assessing interactive AI capabilities.
title	BotzoneBench: Scalable LLM Evaluation via Graded AI Anchors
topic	Artificial Intelligence
url	https://arxiv.org/abs/2602.13214

Similar Items