Saved in:
Bibliographic Details
Main Authors: Mereu, Riccardo, Scannell, Aidan, Hou, Yuxin, Zhao, Yi, Jitta, Aditya, Dominguez, Antonio, Acerbi, Luigi, Storkey, Amos, Chang, Paul
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2510.07092
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866915539496992768
author Mereu, Riccardo
Scannell, Aidan
Hou, Yuxin
Zhao, Yi
Jitta, Aditya
Dominguez, Antonio
Acerbi, Luigi
Storkey, Amos
Chang, Paul
author_facet Mereu, Riccardo
Scannell, Aidan
Hou, Yuxin
Zhao, Yi
Jitta, Aditya
Dominguez, Antonio
Acerbi, Luigi
Storkey, Amos
Chang, Paul
contents World models are a powerful paradigm in AI and robotics, enabling agents to reason about the future by predicting visual observations or compact latent states. The 1X World Model Challenge introduces an open-source benchmark of real-world humanoid interaction, with two complementary tracks: sampling, focused on forecasting future image frames, and compression, focused on predicting future discrete latent codes. For the sampling track, we adapt the video generation foundation model Wan-2.2 TI2V-5B to video-state-conditioned future frame prediction. We condition the video generation on robot states using AdaLN-Zero, and further post-train the model using LoRA. For the compression track, we train a Spatio-Temporal Transformer model from scratch. Our models achieve 23.0 dB PSNR in the sampling task and a Top-500 CE of 6.6386 in the compression task, securing 1st place in both challenges.
format Preprint
id arxiv_https___arxiv_org_abs_2510_07092
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Generative World Modelling for Humanoids: 1X World Model Challenge Technical Report
Mereu, Riccardo
Scannell, Aidan
Hou, Yuxin
Zhao, Yi
Jitta, Aditya
Dominguez, Antonio
Acerbi, Luigi
Storkey, Amos
Chang, Paul
Machine Learning
Artificial Intelligence
Robotics
World models are a powerful paradigm in AI and robotics, enabling agents to reason about the future by predicting visual observations or compact latent states. The 1X World Model Challenge introduces an open-source benchmark of real-world humanoid interaction, with two complementary tracks: sampling, focused on forecasting future image frames, and compression, focused on predicting future discrete latent codes. For the sampling track, we adapt the video generation foundation model Wan-2.2 TI2V-5B to video-state-conditioned future frame prediction. We condition the video generation on robot states using AdaLN-Zero, and further post-train the model using LoRA. For the compression track, we train a Spatio-Temporal Transformer model from scratch. Our models achieve 23.0 dB PSNR in the sampling task and a Top-500 CE of 6.6386 in the compression task, securing 1st place in both challenges.
title Generative World Modelling for Humanoids: 1X World Model Challenge Technical Report
topic Machine Learning
Artificial Intelligence
Robotics
url https://arxiv.org/abs/2510.07092