Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Costarelli, Anthony, Allen, Mat, Hauksson, Roman, Sodunke, Grace, Hariharan, Suhas, Cheng, Carlson, Li, Wenjie, Clymer, Joshua, Yadav, Arjun
Format:	Preprint
Published:	2024
Subjects:	Computation and Language Artificial Intelligence
Online Access:	https://arxiv.org/abs/2406.06613
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914881156939776
author	Costarelli, Anthony Allen, Mat Hauksson, Roman Sodunke, Grace Hariharan, Suhas Cheng, Carlson Li, Wenjie Clymer, Joshua Yadav, Arjun
author_facet	Costarelli, Anthony Allen, Mat Hauksson, Roman Sodunke, Grace Hariharan, Suhas Cheng, Carlson Li, Wenjie Clymer, Joshua Yadav, Arjun
contents	Large language models have demonstrated remarkable few-shot performance on many natural language understanding tasks. Despite several demonstrations of using large language models in complex, strategic scenarios, there lacks a comprehensive framework for evaluating agents' performance across various types of reasoning found in games. To address this gap, we introduce GameBench, a cross-domain benchmark for evaluating strategic reasoning abilities of LLM agents. We focus on 9 different game environments, where each covers at least one axis of key reasoning skill identified in strategy games, and select games for which strategy explanations are unlikely to form a significant portion of models' pretraining corpuses. Our evaluations use GPT-3 and GPT-4 in their base form along with two scaffolding frameworks designed to enhance strategic reasoning ability: Chain-of-Thought (CoT) prompting and Reasoning Via Planning (RAP). Our results show that none of the tested models match human performance, and at worst GPT-4 performs worse than random action. CoT and RAP both improve scores but not comparable to human levels.
format	Preprint
id	arxiv_https___arxiv_org_abs_2406_06613
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	GameBench: Evaluating Strategic Reasoning Abilities of LLM Agents Costarelli, Anthony Allen, Mat Hauksson, Roman Sodunke, Grace Hariharan, Suhas Cheng, Carlson Li, Wenjie Clymer, Joshua Yadav, Arjun Computation and Language Artificial Intelligence Large language models have demonstrated remarkable few-shot performance on many natural language understanding tasks. Despite several demonstrations of using large language models in complex, strategic scenarios, there lacks a comprehensive framework for evaluating agents' performance across various types of reasoning found in games. To address this gap, we introduce GameBench, a cross-domain benchmark for evaluating strategic reasoning abilities of LLM agents. We focus on 9 different game environments, where each covers at least one axis of key reasoning skill identified in strategy games, and select games for which strategy explanations are unlikely to form a significant portion of models' pretraining corpuses. Our evaluations use GPT-3 and GPT-4 in their base form along with two scaffolding frameworks designed to enhance strategic reasoning ability: Chain-of-Thought (CoT) prompting and Reasoning Via Planning (RAP). Our results show that none of the tested models match human performance, and at worst GPT-4 performs worse than random action. CoT and RAP both improve scores but not comparable to human levels.
title	GameBench: Evaluating Strategic Reasoning Abilities of LLM Agents
topic	Computation and Language Artificial Intelligence
url	https://arxiv.org/abs/2406.06613

Similar Items