Saved in:
| Main Authors: | , , , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2510.07092 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866915539496992768 |
|---|---|
| author | Mereu, Riccardo Scannell, Aidan Hou, Yuxin Zhao, Yi Jitta, Aditya Dominguez, Antonio Acerbi, Luigi Storkey, Amos Chang, Paul |
| author_facet | Mereu, Riccardo Scannell, Aidan Hou, Yuxin Zhao, Yi Jitta, Aditya Dominguez, Antonio Acerbi, Luigi Storkey, Amos Chang, Paul |
| contents | World models are a powerful paradigm in AI and robotics, enabling agents to reason about the future by predicting visual observations or compact latent states. The 1X World Model Challenge introduces an open-source benchmark of real-world humanoid interaction, with two complementary tracks: sampling, focused on forecasting future image frames, and compression, focused on predicting future discrete latent codes. For the sampling track, we adapt the video generation foundation model Wan-2.2 TI2V-5B to video-state-conditioned future frame prediction. We condition the video generation on robot states using AdaLN-Zero, and further post-train the model using LoRA. For the compression track, we train a Spatio-Temporal Transformer model from scratch. Our models achieve 23.0 dB PSNR in the sampling task and a Top-500 CE of 6.6386 in the compression task, securing 1st place in both challenges. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2510_07092 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | Generative World Modelling for Humanoids: 1X World Model Challenge Technical Report Mereu, Riccardo Scannell, Aidan Hou, Yuxin Zhao, Yi Jitta, Aditya Dominguez, Antonio Acerbi, Luigi Storkey, Amos Chang, Paul Machine Learning Artificial Intelligence Robotics World models are a powerful paradigm in AI and robotics, enabling agents to reason about the future by predicting visual observations or compact latent states. The 1X World Model Challenge introduces an open-source benchmark of real-world humanoid interaction, with two complementary tracks: sampling, focused on forecasting future image frames, and compression, focused on predicting future discrete latent codes. For the sampling track, we adapt the video generation foundation model Wan-2.2 TI2V-5B to video-state-conditioned future frame prediction. We condition the video generation on robot states using AdaLN-Zero, and further post-train the model using LoRA. For the compression track, we train a Spatio-Temporal Transformer model from scratch. Our models achieve 23.0 dB PSNR in the sampling task and a Top-500 CE of 6.6386 in the compression task, securing 1st place in both challenges. |
| title | Generative World Modelling for Humanoids: 1X World Model Challenge Technical Report |
| topic | Machine Learning Artificial Intelligence Robotics |
| url | https://arxiv.org/abs/2510.07092 |