Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	InSpatio Team, Shen, Donghui, Zhang, Guofeng, Liu, Haomin, Ji, Haoyu, Bao, Hujun, Zhai, Hongjia, Liu, Jialin, Guo, Jing, Wang, Nan, Pan, Siji, Pan, Weihong, Xie, Weijian, Liu, Xianbin, Xiang, Xiaojun, Zhang, Xiaoyu, Chen, Xinyu, Wang, Yifu, Chen, Yipeng, Fan, Zhenzhou, Le, Zhewen, Ye, Zhichao, Zhao, Ziqiang
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2604.07209
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866918442361159680
author	InSpatio Team Shen, Donghui Zhang, Guofeng Liu, Haomin Ji, Haoyu Bao, Hujun Zhai, Hongjia Liu, Jialin Guo, Jing Wang, Nan Pan, Siji Pan, Weihong Xie, Weijian Liu, Xianbin Xiang, Xiaojun Zhang, Xiaoyu Chen, Xinyu Wang, Yifu Chen, Yipeng Fan, Zhenzhou Le, Zhewen Ye, Zhichao Zhao, Ziqiang
author_facet	InSpatio Team Shen, Donghui Zhang, Guofeng Liu, Haomin Ji, Haoyu Bao, Hujun Zhai, Hongjia Liu, Jialin Guo, Jing Wang, Nan Pan, Siji Pan, Weihong Xie, Weijian Liu, Xianbin Xiang, Xiaojun Zhang, Xiaoyu Chen, Xinyu Wang, Yifu Chen, Yipeng Fan, Zhenzhou Le, Zhewen Ye, Zhichao Zhao, Ziqiang
contents	Building world models with spatial consistency and real-time interactivity remains a fundamental challenge in computer vision. Current video generation paradigms often struggle with a lack of spatial persistence and insufficient visual realism, making it difficult to support seamless navigation in complex environments. To address these challenges, we propose INSPATIO-WORLD, a novel real-time framework capable of recovering and generating high-fidelity, dynamic interactive scenes from a single reference video. At the core of our approach is a Spatiotemporal Autoregressive (STAR) architecture, which enables consistent and controllable scene evolution through two tightly coupled components: Implicit Spatiotemporal Cache aggregates reference and historical observations into a latent world representation, ensuring global consistency during long-horizon navigation; Explicit Spatial Constraint Module enforces geometric structure and translates user interactions into precise and physically plausible camera trajectories. Furthermore, we introduce Joint Distribution Matching Distillation (JDMD). By using real-world data distributions as a regularizing guide, JDMD effectively overcomes the fidelity degradation typically caused by over-reliance on synthetic data. Extensive experiments demonstrate that INSPATIO-WORLD significantly outperforms existing state-of-the-art (SOTA) models in spatial consistency and interaction precision, ranking first among real-time interactive methods on the WorldScore-Dynamic benchmark, and establishing a practical pipeline for navigating 4D environments reconstructed from monocular videos.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_07209
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling InSpatio Team Shen, Donghui Zhang, Guofeng Liu, Haomin Ji, Haoyu Bao, Hujun Zhai, Hongjia Liu, Jialin Guo, Jing Wang, Nan Pan, Siji Pan, Weihong Xie, Weijian Liu, Xianbin Xiang, Xiaojun Zhang, Xiaoyu Chen, Xinyu Wang, Yifu Chen, Yipeng Fan, Zhenzhou Le, Zhewen Ye, Zhichao Zhao, Ziqiang Computer Vision and Pattern Recognition Building world models with spatial consistency and real-time interactivity remains a fundamental challenge in computer vision. Current video generation paradigms often struggle with a lack of spatial persistence and insufficient visual realism, making it difficult to support seamless navigation in complex environments. To address these challenges, we propose INSPATIO-WORLD, a novel real-time framework capable of recovering and generating high-fidelity, dynamic interactive scenes from a single reference video. At the core of our approach is a Spatiotemporal Autoregressive (STAR) architecture, which enables consistent and controllable scene evolution through two tightly coupled components: Implicit Spatiotemporal Cache aggregates reference and historical observations into a latent world representation, ensuring global consistency during long-horizon navigation; Explicit Spatial Constraint Module enforces geometric structure and translates user interactions into precise and physically plausible camera trajectories. Furthermore, we introduce Joint Distribution Matching Distillation (JDMD). By using real-world data distributions as a regularizing guide, JDMD effectively overcomes the fidelity degradation typically caused by over-reliance on synthetic data. Extensive experiments demonstrate that INSPATIO-WORLD significantly outperforms existing state-of-the-art (SOTA) models in spatial consistency and interaction precision, ranking first among real-time interactive methods on the WorldScore-Dynamic benchmark, and establishing a practical pipeline for navigating 4D environments reconstructed from monocular videos.
title	INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2604.07209

Similar Items