Saved in:
Bibliographic Details
Main Authors: Chen, John, Cheng, Sihan, Gurkan, Can, Lin, Mingyi
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2604.07733
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911578118422528
author Chen, John
Cheng, Sihan
Gurkan, Can
Lin, Mingyi
author_facet Chen, John
Cheng, Sihan
Gurkan, Can
Lin, Mingyi
contents Evaluating strategic decision-making in LLM-based agents requires generative, competitive, and longitudinal environments, yet few benchmarks provide all three, and fewer still offer evaluation signals rich enough for long-horizon, multi-agent play. We introduce CivBench, a benchmark for LLM strategists (i.e., agentic setups) in multiplayer Civilization V. Because terminal win/loss is too sparse a signal in games spanning hundreds of turns and multiple opponents, CivBench trains models on turn-level game state to estimate victory probabilities throughout play, validated through predictive, construct, and convergent validity. Across 307 games with 7 LLMs and multiple CivBench agent conditions, we demonstrate CivBench's potential to estimate strategic capabilities as an unsaturated benchmark, reveal model-specific effects of agentic setup, and outline distinct strategic profiles not visible through outcome-only evaluation.
format Preprint
id arxiv_https___arxiv_org_abs_2604_07733
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle CivBench: Progress-Based Evaluation for LLMs' Strategic Decision-Making in Civilization V
Chen, John
Cheng, Sihan
Gurkan, Can
Lin, Mingyi
Artificial Intelligence
Evaluating strategic decision-making in LLM-based agents requires generative, competitive, and longitudinal environments, yet few benchmarks provide all three, and fewer still offer evaluation signals rich enough for long-horizon, multi-agent play. We introduce CivBench, a benchmark for LLM strategists (i.e., agentic setups) in multiplayer Civilization V. Because terminal win/loss is too sparse a signal in games spanning hundreds of turns and multiple opponents, CivBench trains models on turn-level game state to estimate victory probabilities throughout play, validated through predictive, construct, and convergent validity. Across 307 games with 7 LLMs and multiple CivBench agent conditions, we demonstrate CivBench's potential to estimate strategic capabilities as an unsaturated benchmark, reveal model-specific effects of agentic setup, and outline distinct strategic profiles not visible through outcome-only evaluation.
title CivBench: Progress-Based Evaluation for LLMs' Strategic Decision-Making in Civilization V
topic Artificial Intelligence
url https://arxiv.org/abs/2604.07733