Saved in:
Bibliographic Details
Main Authors: Li, Zehan, Wang, Yuxuan, Lahib, Ali El, Xia, Ying-Jieh, Pi, Xinyu
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2601.13717
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866908776049672192
author Li, Zehan
Wang, Yuxuan
Lahib, Ali El
Xia, Ying-Jieh
Pi, Xinyu
author_facet Li, Zehan
Wang, Yuxuan
Lahib, Ali El
Xia, Ying-Jieh
Pi, Xinyu
contents Evaluating LLM forecasting capabilities is constrained by a fundamental tension: prospective evaluation offers methodological rigor but prohibitive latency, while retrospective forecasting (RF) -- evaluating on already-resolved events -- faces rapidly shrinking clean evaluation data as SOTA models possess increasingly recent knowledge cutoffs. Simulated Ignorance (SI), prompting models to suppress pre-cutoff knowledge, has emerged as a potential solution. We provide the first systematic test of whether SI can approximate True Ignorance (TI). Across 477 competition-level questions and 9 models, we find that SI fails systematically: (1) cutoff instructions leave a 52% performance gap between SI and TI; (2) chain-of-thought reasoning fails to suppress prior knowledge, even when reasoning traces contain no explicit post-cutoff references; (3) reasoning-optimized models exhibit worse SI fidelity despite superior reasoning trace quality. These findings demonstrate that prompts cannot reliably "rewind" model knowledge. We conclude that RF on pre-cutoff events is methodologically flawed; we recommend against using SI-based retrospective setups to benchmark forecasting capabilities.
format Preprint
id arxiv_https___arxiv_org_abs_2601_13717
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Simulated Ignorance Fails: A Systematic Study of LLM Behaviors on Forecasting Problems Before Model Knowledge Cutoff
Li, Zehan
Wang, Yuxuan
Lahib, Ali El
Xia, Ying-Jieh
Pi, Xinyu
Computation and Language
Artificial Intelligence
Evaluating LLM forecasting capabilities is constrained by a fundamental tension: prospective evaluation offers methodological rigor but prohibitive latency, while retrospective forecasting (RF) -- evaluating on already-resolved events -- faces rapidly shrinking clean evaluation data as SOTA models possess increasingly recent knowledge cutoffs. Simulated Ignorance (SI), prompting models to suppress pre-cutoff knowledge, has emerged as a potential solution. We provide the first systematic test of whether SI can approximate True Ignorance (TI). Across 477 competition-level questions and 9 models, we find that SI fails systematically: (1) cutoff instructions leave a 52% performance gap between SI and TI; (2) chain-of-thought reasoning fails to suppress prior knowledge, even when reasoning traces contain no explicit post-cutoff references; (3) reasoning-optimized models exhibit worse SI fidelity despite superior reasoning trace quality. These findings demonstrate that prompts cannot reliably "rewind" model knowledge. We conclude that RF on pre-cutoff events is methodologically flawed; we recommend against using SI-based retrospective setups to benchmark forecasting capabilities.
title Simulated Ignorance Fails: A Systematic Study of LLM Behaviors on Forecasting Problems Before Model Knowledge Cutoff
topic Computation and Language
Artificial Intelligence
url https://arxiv.org/abs/2601.13717