Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Author:	Gupta, Kshitij
Format:	Preprint
Published:	2025
Subjects:	Computation and Language Artificial Intelligence
Online Access:	https://arxiv.org/abs/2502.07747
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913686286761984
author	Gupta, Kshitij
author_facet	Gupta, Kshitij
contents	We present a novel data set, WhoDunIt, to assess the deductive reasoning capabilities of large language models (LLM) within narrative contexts. Constructed from open domain mystery novels and short stories, the dataset challenges LLMs to identify the perpetrator after reading and comprehending the story. To evaluate model robustness, we apply a range of character-level name augmentations, including original names, name swaps, and substitutions with well-known real and/or fictional entities from popular discourse. We further use various prompting styles to investigate the influence of prompting on deductive reasoning accuracy. We conduct evaluation study with state-of-the-art models, specifically GPT-4o, GPT-4-turbo, and GPT-4o-mini, evaluated through multiple trials with majority response selection to ensure reliability. The results demonstrate that while LLMs perform reliably on unaltered texts, accuracy diminishes with certain name substitutions, particularly those with wide recognition. This dataset is publicly available here.
format	Preprint
id	arxiv_https___arxiv_org_abs_2502_07747
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	WHODUNIT: Evaluation benchmark for culprit detection in mystery stories Gupta, Kshitij Computation and Language Artificial Intelligence We present a novel data set, WhoDunIt, to assess the deductive reasoning capabilities of large language models (LLM) within narrative contexts. Constructed from open domain mystery novels and short stories, the dataset challenges LLMs to identify the perpetrator after reading and comprehending the story. To evaluate model robustness, we apply a range of character-level name augmentations, including original names, name swaps, and substitutions with well-known real and/or fictional entities from popular discourse. We further use various prompting styles to investigate the influence of prompting on deductive reasoning accuracy. We conduct evaluation study with state-of-the-art models, specifically GPT-4o, GPT-4-turbo, and GPT-4o-mini, evaluated through multiple trials with majority response selection to ensure reliability. The results demonstrate that while LLMs perform reliably on unaltered texts, accuracy diminishes with certain name substitutions, particularly those with wide recognition. This dataset is publicly available here.
title	WHODUNIT: Evaluation benchmark for culprit detection in mystery stories
topic	Computation and Language Artificial Intelligence
url	https://arxiv.org/abs/2502.07747

Similar Items