Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	PAN Team, Xiang, Jiannan, Gu, Yi, Liu, Zihan, Feng, Zeyu, Gao, Qiyue, Hu, Yiyan, Huang, Benhao, Liu, Guangyi, Yang, Yichi, Zhou, Kun, Abrahamyan, Davit, Ahmad, Arif, Bannur, Ganesh, Chen, Junrong, Chen, Kimi, Deng, Mingkai, Han, Ruobing, Huang, Xinqi, Kang, Haoqiang, Liu, Zheqi, Ma, Enze, Ren, Hector, Shinde, Yashowardhan, Shingre, Rohan, Tanikella, Ramsundar, Tao, Kaiming, Yang, Dequan, Yu, Xinle, Zeng, Cong, Zhou, Binglin, Liu, Zhengzhong, Hu, Zhiting, Xing, Eric P.
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence Computation and Language Machine Learning
Online Access:	https://arxiv.org/abs/2511.09057
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909903103197184
author	PAN Team Xiang, Jiannan Gu, Yi Liu, Zihan Feng, Zeyu Gao, Qiyue Hu, Yiyan Huang, Benhao Liu, Guangyi Yang, Yichi Zhou, Kun Abrahamyan, Davit Ahmad, Arif Bannur, Ganesh Chen, Junrong Chen, Kimi Deng, Mingkai Han, Ruobing Huang, Xinqi Kang, Haoqiang Liu, Zheqi Ma, Enze Ren, Hector Shinde, Yashowardhan Shingre, Rohan Tanikella, Ramsundar Tao, Kaiming Yang, Dequan Yu, Xinle Zeng, Cong Zhou, Binglin Liu, Zhengzhong Hu, Zhiting Xing, Eric P.
author_facet	PAN Team Xiang, Jiannan Gu, Yi Liu, Zihan Feng, Zeyu Gao, Qiyue Hu, Yiyan Huang, Benhao Liu, Guangyi Yang, Yichi Zhou, Kun Abrahamyan, Davit Ahmad, Arif Bannur, Ganesh Chen, Junrong Chen, Kimi Deng, Mingkai Han, Ruobing Huang, Xinqi Kang, Haoqiang Liu, Zheqi Ma, Enze Ren, Hector Shinde, Yashowardhan Shingre, Rohan Tanikella, Ramsundar Tao, Kaiming Yang, Dequan Yu, Xinle Zeng, Cong Zhou, Binglin Liu, Zhengzhong Hu, Zhiting Xing, Eric P.
contents	A world model enables an intelligent agent to imagine, predict, and reason about how the world evolves in response to its actions, and accordingly to plan and strategize. While recent video generation models produce realistic visual sequences, they typically operate in the prompt-to-full-video manner without causal control, interactivity, or long-horizon consistency required for purposeful reasoning. Existing world modeling efforts, on the other hand, often focus on restricted domains (e.g., physical, game, or 3D-scene dynamics) with limited depth and controllability, and struggle to generalize across diverse environments and interaction formats. In this work, we introduce PAN, a general, interactable, and long-horizon world model that predicts future world states through high-quality video simulation conditioned on history and natural language actions. PAN employs the Generative Latent Prediction (GLP) architecture that combines an autoregressive latent dynamics backbone based on a large language model (LLM), which grounds simulation in extensive text-based knowledge and enables conditioning on language-specified actions, with a video diffusion decoder that reconstructs perceptually detailed and temporally coherent visual observations, to achieve a unification between latent space reasoning (imagination) and realizable world dynamics (reality). Trained on large-scale video-action pairs spanning diverse domains, PAN supports open-domain, action-conditioned simulation with coherent, long-term dynamics. Extensive experiments show that PAN achieves strong performance in action-conditioned world simulation, long-horizon forecasting, and simulative reasoning compared to other video generators and world models, taking a step towards general world models that enable predictive simulation of future world states for reasoning and acting.
format	Preprint
id	arxiv_https___arxiv_org_abs_2511_09057
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	PAN: A World Model for General, Interactable, and Long-Horizon World Simulation PAN Team Xiang, Jiannan Gu, Yi Liu, Zihan Feng, Zeyu Gao, Qiyue Hu, Yiyan Huang, Benhao Liu, Guangyi Yang, Yichi Zhou, Kun Abrahamyan, Davit Ahmad, Arif Bannur, Ganesh Chen, Junrong Chen, Kimi Deng, Mingkai Han, Ruobing Huang, Xinqi Kang, Haoqiang Liu, Zheqi Ma, Enze Ren, Hector Shinde, Yashowardhan Shingre, Rohan Tanikella, Ramsundar Tao, Kaiming Yang, Dequan Yu, Xinle Zeng, Cong Zhou, Binglin Liu, Zhengzhong Hu, Zhiting Xing, Eric P. Computer Vision and Pattern Recognition Artificial Intelligence Computation and Language Machine Learning A world model enables an intelligent agent to imagine, predict, and reason about how the world evolves in response to its actions, and accordingly to plan and strategize. While recent video generation models produce realistic visual sequences, they typically operate in the prompt-to-full-video manner without causal control, interactivity, or long-horizon consistency required for purposeful reasoning. Existing world modeling efforts, on the other hand, often focus on restricted domains (e.g., physical, game, or 3D-scene dynamics) with limited depth and controllability, and struggle to generalize across diverse environments and interaction formats. In this work, we introduce PAN, a general, interactable, and long-horizon world model that predicts future world states through high-quality video simulation conditioned on history and natural language actions. PAN employs the Generative Latent Prediction (GLP) architecture that combines an autoregressive latent dynamics backbone based on a large language model (LLM), which grounds simulation in extensive text-based knowledge and enables conditioning on language-specified actions, with a video diffusion decoder that reconstructs perceptually detailed and temporally coherent visual observations, to achieve a unification between latent space reasoning (imagination) and realizable world dynamics (reality). Trained on large-scale video-action pairs spanning diverse domains, PAN supports open-domain, action-conditioned simulation with coherent, long-term dynamics. Extensive experiments show that PAN achieves strong performance in action-conditioned world simulation, long-horizon forecasting, and simulative reasoning compared to other video generators and world models, taking a step towards general world models that enable predictive simulation of future world states for reasoning and acting.
title	PAN: A World Model for General, Interactable, and Long-Horizon World Simulation
topic	Computer Vision and Pattern Recognition Artificial Intelligence Computation and Language Machine Learning
url	https://arxiv.org/abs/2511.09057

Similar Items