Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Li, Quanyi, Feng, Lan, Zhang, Haonan, Li, Wuyang, Wang, Letian, Alahi, Alexandre, Soh, Harold
Format:	Preprint
Published:	2026
Subjects:	Robotics Artificial Intelligence
Online Access:	https://arxiv.org/abs/2604.11751
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914469671600128
author	Li, Quanyi Feng, Lan Zhang, Haonan Li, Wuyang Wang, Letian Alahi, Alexandre Soh, Harold
author_facet	Li, Quanyi Feng, Lan Zhang, Haonan Li, Wuyang Wang, Letian Alahi, Alexandre Soh, Harold
contents	In Model Predictive Control (MPC), world models predict the future outcomes of various action proposals, which are then scored to guide the selection of the optimal action. For visuomotor MPC, the score function is a distance metric between a predicted image and a goal image, measured in the latent space of a pretrained vision encoder like DINO and JEPA. However, it is challenging to obtain the goal image in advance of the task execution, particularly in new environments. Additionally, conveying the goal through an image offers limited interactivity compared with natural language. In this work, we propose to learn a Grounded World Model (GWM) in a vision-language-aligned latent space. As a result, each proposed action is scored based on how close its future outcome is to the task instruction, reflected by the similarity of embeddings. This approach transforms the visuomotor MPC to a VLA that surpasses VLM-based VLAs in semantic generalization. On the proposed WISER benchmark, GWM-MPC achieves a 87% success rate on the test set comprising 288 tasks that feature unseen visual signals and referring expressions, yet remain solvable with motions demonstrated during training. In contrast, traditional VLAs achieve an average success rate of 22%, even though they overfit the training set with a 90% success rate.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_11751
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Grounded World Model for Semantically Generalizable Planning Li, Quanyi Feng, Lan Zhang, Haonan Li, Wuyang Wang, Letian Alahi, Alexandre Soh, Harold Robotics Artificial Intelligence In Model Predictive Control (MPC), world models predict the future outcomes of various action proposals, which are then scored to guide the selection of the optimal action. For visuomotor MPC, the score function is a distance metric between a predicted image and a goal image, measured in the latent space of a pretrained vision encoder like DINO and JEPA. However, it is challenging to obtain the goal image in advance of the task execution, particularly in new environments. Additionally, conveying the goal through an image offers limited interactivity compared with natural language. In this work, we propose to learn a Grounded World Model (GWM) in a vision-language-aligned latent space. As a result, each proposed action is scored based on how close its future outcome is to the task instruction, reflected by the similarity of embeddings. This approach transforms the visuomotor MPC to a VLA that surpasses VLM-based VLAs in semantic generalization. On the proposed WISER benchmark, GWM-MPC achieves a 87% success rate on the test set comprising 288 tasks that feature unseen visual signals and referring expressions, yet remain solvable with motions demonstrated during training. In contrast, traditional VLAs achieve an average success rate of 22%, even though they overfit the training set with a 90% success rate.
title	Grounded World Model for Semantically Generalizable Planning
topic	Robotics Artificial Intelligence
url	https://arxiv.org/abs/2604.11751

Similar Items