Saved in:
Bibliographic Details
Main Authors: Wang, Zhengxiang, Dong, Zeyu
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2601.07148
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866908906419126272
author Wang, Zhengxiang
Dong, Zeyu
author_facet Wang, Zhengxiang
Dong, Zeyu
contents Tool use, such as web search, has become a standard capability even in freely available large language models (LLMs). However, existing benchmarks evaluate temporal reasoning mainly in static, non-tool-using settings, which poorly reflect how LLMs perform temporal reasoning in practice. We introduce Time Puzzles, a constraint-based date inference task for evaluating iterative temporal reasoning with tools. Each puzzle combines factual temporal anchors with (cross-cultural) calendar relations and may admit one or multiple valid dates. The puzzles are algorithmically generated, enabling controlled and continual evaluation. Across 13 LLMs, even the best model (GPT-5) achieves only 55.3% accuracy without tools, despite using easily searchable facts. While web search improves performance, models perform substantially better when constraints are rewritten with explicit dates, removing the need for factual lookup. These results reveal a gap in reliable tool use for iterative temporal reasoning.
format Preprint
id arxiv_https___arxiv_org_abs_2601_07148
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Measuring Iterative Temporal Reasoning with Time Puzzles
Wang, Zhengxiang
Dong, Zeyu
Computation and Language
Artificial Intelligence
Tool use, such as web search, has become a standard capability even in freely available large language models (LLMs). However, existing benchmarks evaluate temporal reasoning mainly in static, non-tool-using settings, which poorly reflect how LLMs perform temporal reasoning in practice. We introduce Time Puzzles, a constraint-based date inference task for evaluating iterative temporal reasoning with tools. Each puzzle combines factual temporal anchors with (cross-cultural) calendar relations and may admit one or multiple valid dates. The puzzles are algorithmically generated, enabling controlled and continual evaluation. Across 13 LLMs, even the best model (GPT-5) achieves only 55.3% accuracy without tools, despite using easily searchable facts. While web search improves performance, models perform substantially better when constraints are rewritten with explicit dates, removing the need for factual lookup. These results reveal a gap in reliable tool use for iterative temporal reasoning.
title Measuring Iterative Temporal Reasoning with Time Puzzles
topic Computation and Language
Artificial Intelligence
url https://arxiv.org/abs/2601.07148