Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Pan, Zhikai, Liao, Chih-Ting, Liu, Chunrui, Xiao, Xi, Qiao, Yitong, Meng, Chunlei, Chen, Zhangquan, Cao, Xin
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2605.28277
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913168066871296
author	Pan, Zhikai Liao, Chih-Ting Liu, Chunrui Xiao, Xi Qiao, Yitong Meng, Chunlei Chen, Zhangquan Cao, Xin
author_facet	Pan, Zhikai Liao, Chih-Ting Liu, Chunrui Xiao, Xi Qiao, Yitong Meng, Chunlei Chen, Zhangquan Cao, Xin
contents	Whether large language models (LLMs) construct internal spatial world models from pure-text descriptions remains contested, and whether such capabilities transfer across languages has not been systematically studied. We introduce MentalMap, a multilingual diagnostic benchmark with a six-level capability hierarchy (L0-L5) spanning atomic spatial facts to generative world-graph construction, together with four diagnostic axes probing frame of reference, reading-direction bias, reasoning-effort allocation, and hallucination. MentalMap is built from 100 ProcTHOR household scenes, covers eight typologically diverse languages plus a structured-text control, and contains 39 task families across 1,950 evaluation cells. Evaluating thirteen LLMs across scales and model families, we identify a universal L3 reasoning cliff: no model retains even half of its L0 performance on viewpoint reasoning once baseline atomic accuracy exceeds 40%. The cliff persists across languages, scales, and prompting strategies, while structured-output failures and reasoning patterns vary substantially across models. Human evaluation under the identical pure-text protocol reproduces the same failure pattern, suggesting that the bottleneck arises from text-only working memory constraints rather than being specific to current LLM architectures. Our findings reframe pure-text spatial reasoning as a multi-axis world-modeling problem and motivate multimodal and scratchpad-augmented reasoning as future directions.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_28277
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Do LLMs Build World Models From Text? A Multilingual Diagnostic of Spatial Reasoning Pan, Zhikai Liao, Chih-Ting Liu, Chunrui Xiao, Xi Qiao, Yitong Meng, Chunlei Chen, Zhangquan Cao, Xin Artificial Intelligence Whether large language models (LLMs) construct internal spatial world models from pure-text descriptions remains contested, and whether such capabilities transfer across languages has not been systematically studied. We introduce MentalMap, a multilingual diagnostic benchmark with a six-level capability hierarchy (L0-L5) spanning atomic spatial facts to generative world-graph construction, together with four diagnostic axes probing frame of reference, reading-direction bias, reasoning-effort allocation, and hallucination. MentalMap is built from 100 ProcTHOR household scenes, covers eight typologically diverse languages plus a structured-text control, and contains 39 task families across 1,950 evaluation cells. Evaluating thirteen LLMs across scales and model families, we identify a universal L3 reasoning cliff: no model retains even half of its L0 performance on viewpoint reasoning once baseline atomic accuracy exceeds 40%. The cliff persists across languages, scales, and prompting strategies, while structured-output failures and reasoning patterns vary substantially across models. Human evaluation under the identical pure-text protocol reproduces the same failure pattern, suggesting that the bottleneck arises from text-only working memory constraints rather than being specific to current LLM architectures. Our findings reframe pure-text spatial reasoning as a multi-axis world-modeling problem and motivate multimodal and scratchpad-augmented reasoning as future directions.
title	Do LLMs Build World Models From Text? A Multilingual Diagnostic of Spatial Reasoning
topic	Artificial Intelligence
url	https://arxiv.org/abs/2605.28277

Similar Items