Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wang, Zhengxiang, Dong, Zeyu
Format:	Preprint
Published:	2026
Subjects:	Computation and Language Artificial Intelligence
Online Access:	https://arxiv.org/abs/2601.07148
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908906419126272
author	Wang, Zhengxiang Dong, Zeyu
author_facet	Wang, Zhengxiang Dong, Zeyu
contents	Tool use, such as web search, has become a standard capability even in freely available large language models (LLMs). However, existing benchmarks evaluate temporal reasoning mainly in static, non-tool-using settings, which poorly reflect how LLMs perform temporal reasoning in practice. We introduce Time Puzzles, a constraint-based date inference task for evaluating iterative temporal reasoning with tools. Each puzzle combines factual temporal anchors with (cross-cultural) calendar relations and may admit one or multiple valid dates. The puzzles are algorithmically generated, enabling controlled and continual evaluation. Across 13 LLMs, even the best model (GPT-5) achieves only 55.3% accuracy without tools, despite using easily searchable facts. While web search improves performance, models perform substantially better when constraints are rewritten with explicit dates, removing the need for factual lookup. These results reveal a gap in reliable tool use for iterative temporal reasoning.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_07148
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Measuring Iterative Temporal Reasoning with Time Puzzles Wang, Zhengxiang Dong, Zeyu Computation and Language Artificial Intelligence Tool use, such as web search, has become a standard capability even in freely available large language models (LLMs). However, existing benchmarks evaluate temporal reasoning mainly in static, non-tool-using settings, which poorly reflect how LLMs perform temporal reasoning in practice. We introduce Time Puzzles, a constraint-based date inference task for evaluating iterative temporal reasoning with tools. Each puzzle combines factual temporal anchors with (cross-cultural) calendar relations and may admit one or multiple valid dates. The puzzles are algorithmically generated, enabling controlled and continual evaluation. Across 13 LLMs, even the best model (GPT-5) achieves only 55.3% accuracy without tools, despite using easily searchable facts. While web search improves performance, models perform substantially better when constraints are rewritten with explicit dates, removing the need for factual lookup. These results reveal a gap in reliable tool use for iterative temporal reasoning.
title	Measuring Iterative Temporal Reasoning with Time Puzzles
topic	Computation and Language Artificial Intelligence
url	https://arxiv.org/abs/2601.07148

Similar Items