Saved in:
Bibliographic Details
Main Authors: InSpatio Team, Shen, Donghui, Zhang, Guofeng, Liu, Haomin, Ji, Haoyu, Bao, Hujun, Zhai, Hongjia, Liu, Jialin, Guo, Jing, Wang, Nan, Pan, Siji, Pan, Weihong, Xie, Weijian, Liu, Xianbin, Xiang, Xiaojun, Zhang, Xiaoyu, Chen, Xinyu, Wang, Yifu, Chen, Yipeng, Fan, Zhenzhou, Le, Zhewen, Ye, Zhichao, Zhao, Ziqiang
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2604.07209
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866918442361159680
author InSpatio Team
Shen, Donghui
Zhang, Guofeng
Liu, Haomin
Ji, Haoyu
Bao, Hujun
Zhai, Hongjia
Liu, Jialin
Guo, Jing
Wang, Nan
Pan, Siji
Pan, Weihong
Xie, Weijian
Liu, Xianbin
Xiang, Xiaojun
Zhang, Xiaoyu
Chen, Xinyu
Wang, Yifu
Chen, Yipeng
Fan, Zhenzhou
Le, Zhewen
Ye, Zhichao
Zhao, Ziqiang
author_facet InSpatio Team
Shen, Donghui
Zhang, Guofeng
Liu, Haomin
Ji, Haoyu
Bao, Hujun
Zhai, Hongjia
Liu, Jialin
Guo, Jing
Wang, Nan
Pan, Siji
Pan, Weihong
Xie, Weijian
Liu, Xianbin
Xiang, Xiaojun
Zhang, Xiaoyu
Chen, Xinyu
Wang, Yifu
Chen, Yipeng
Fan, Zhenzhou
Le, Zhewen
Ye, Zhichao
Zhao, Ziqiang
contents Building world models with spatial consistency and real-time interactivity remains a fundamental challenge in computer vision. Current video generation paradigms often struggle with a lack of spatial persistence and insufficient visual realism, making it difficult to support seamless navigation in complex environments. To address these challenges, we propose INSPATIO-WORLD, a novel real-time framework capable of recovering and generating high-fidelity, dynamic interactive scenes from a single reference video. At the core of our approach is a Spatiotemporal Autoregressive (STAR) architecture, which enables consistent and controllable scene evolution through two tightly coupled components: Implicit Spatiotemporal Cache aggregates reference and historical observations into a latent world representation, ensuring global consistency during long-horizon navigation; Explicit Spatial Constraint Module enforces geometric structure and translates user interactions into precise and physically plausible camera trajectories. Furthermore, we introduce Joint Distribution Matching Distillation (JDMD). By using real-world data distributions as a regularizing guide, JDMD effectively overcomes the fidelity degradation typically caused by over-reliance on synthetic data. Extensive experiments demonstrate that INSPATIO-WORLD significantly outperforms existing state-of-the-art (SOTA) models in spatial consistency and interaction precision, ranking first among real-time interactive methods on the WorldScore-Dynamic benchmark, and establishing a practical pipeline for navigating 4D environments reconstructed from monocular videos.
format Preprint
id arxiv_https___arxiv_org_abs_2604_07209
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling
InSpatio Team
Shen, Donghui
Zhang, Guofeng
Liu, Haomin
Ji, Haoyu
Bao, Hujun
Zhai, Hongjia
Liu, Jialin
Guo, Jing
Wang, Nan
Pan, Siji
Pan, Weihong
Xie, Weijian
Liu, Xianbin
Xiang, Xiaojun
Zhang, Xiaoyu
Chen, Xinyu
Wang, Yifu
Chen, Yipeng
Fan, Zhenzhou
Le, Zhewen
Ye, Zhichao
Zhao, Ziqiang
Computer Vision and Pattern Recognition
Building world models with spatial consistency and real-time interactivity remains a fundamental challenge in computer vision. Current video generation paradigms often struggle with a lack of spatial persistence and insufficient visual realism, making it difficult to support seamless navigation in complex environments. To address these challenges, we propose INSPATIO-WORLD, a novel real-time framework capable of recovering and generating high-fidelity, dynamic interactive scenes from a single reference video. At the core of our approach is a Spatiotemporal Autoregressive (STAR) architecture, which enables consistent and controllable scene evolution through two tightly coupled components: Implicit Spatiotemporal Cache aggregates reference and historical observations into a latent world representation, ensuring global consistency during long-horizon navigation; Explicit Spatial Constraint Module enforces geometric structure and translates user interactions into precise and physically plausible camera trajectories. Furthermore, we introduce Joint Distribution Matching Distillation (JDMD). By using real-world data distributions as a regularizing guide, JDMD effectively overcomes the fidelity degradation typically caused by over-reliance on synthetic data. Extensive experiments demonstrate that INSPATIO-WORLD significantly outperforms existing state-of-the-art (SOTA) models in spatial consistency and interaction precision, ranking first among real-time interactive methods on the WorldScore-Dynamic benchmark, and establishing a practical pipeline for navigating 4D environments reconstructed from monocular videos.
title INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2604.07209