Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.07733 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866911578118422528 |
|---|---|
| author | Chen, John Cheng, Sihan Gurkan, Can Lin, Mingyi |
| author_facet | Chen, John Cheng, Sihan Gurkan, Can Lin, Mingyi |
| contents | Evaluating strategic decision-making in LLM-based agents requires generative, competitive, and longitudinal environments, yet few benchmarks provide all three, and fewer still offer evaluation signals rich enough for long-horizon, multi-agent play. We introduce CivBench, a benchmark for LLM strategists (i.e., agentic setups) in multiplayer Civilization V. Because terminal win/loss is too sparse a signal in games spanning hundreds of turns and multiple opponents, CivBench trains models on turn-level game state to estimate victory probabilities throughout play, validated through predictive, construct, and convergent validity. Across 307 games with 7 LLMs and multiple CivBench agent conditions, we demonstrate CivBench's potential to estimate strategic capabilities as an unsaturated benchmark, reveal model-specific effects of agentic setup, and outline distinct strategic profiles not visible through outcome-only evaluation. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2604_07733 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | CivBench: Progress-Based Evaluation for LLMs' Strategic Decision-Making in Civilization V Chen, John Cheng, Sihan Gurkan, Can Lin, Mingyi Artificial Intelligence Evaluating strategic decision-making in LLM-based agents requires generative, competitive, and longitudinal environments, yet few benchmarks provide all three, and fewer still offer evaluation signals rich enough for long-horizon, multi-agent play. We introduce CivBench, a benchmark for LLM strategists (i.e., agentic setups) in multiplayer Civilization V. Because terminal win/loss is too sparse a signal in games spanning hundreds of turns and multiple opponents, CivBench trains models on turn-level game state to estimate victory probabilities throughout play, validated through predictive, construct, and convergent validity. Across 307 games with 7 LLMs and multiple CivBench agent conditions, we demonstrate CivBench's potential to estimate strategic capabilities as an unsaturated benchmark, reveal model-specific effects of agentic setup, and outline distinct strategic profiles not visible through outcome-only evaluation. |
| title | CivBench: Progress-Based Evaluation for LLMs' Strategic Decision-Making in Civilization V |
| topic | Artificial Intelligence |
| url | https://arxiv.org/abs/2604.07733 |