MARC21: :: Library Catalog

Salvato in:

Dettagli Bibliografici
Autori principali:	GigaBrain Team, Wang, Boyuan, Li, Bohan, Ni, Chaojun, Huang, Guan, Zhao, Guosheng, Li, Hao, Li, Jie, Lv, Jindi, Liu, Jingyu, Feng, Lv, Yu, Mingming, Li, Peng, Deng, Qiuping, Liu, Tianze, Zhou, Xinyu, Chen, Xinze, Wang, Xiaofeng, Wang, Yang, Li, Yifan, Nie, Yifei, Li, Yilong, Zhou, Yukun, Ye, Yun, Liu, Zhichao, Zhu, Zheng
Natura:	Preprint
Pubblicazione:	2026
Soggetti:	Computer Vision and Pattern Recognition
Accesso online:	https://arxiv.org/abs/2602.12099
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

_version_	1866917294723039232
author	GigaBrain Team Wang, Boyuan Li, Bohan Ni, Chaojun Huang, Guan Zhao, Guosheng Li, Hao Li, Jie Lv, Jindi Liu, Jingyu Feng, Lv Yu, Mingming Li, Peng Deng, Qiuping Liu, Tianze Zhou, Xinyu Chen, Xinze Wang, Xiaofeng Wang, Yang Li, Yifan Nie, Yifei Li, Yilong Zhou, Yukun Ye, Yun Liu, Zhichao Zhu, Zheng
author_facet	GigaBrain Team Wang, Boyuan Li, Bohan Ni, Chaojun Huang, Guan Zhao, Guosheng Li, Hao Li, Jie Lv, Jindi Liu, Jingyu Feng, Lv Yu, Mingming Li, Peng Deng, Qiuping Liu, Tianze Zhou, Xinyu Chen, Xinze Wang, Xiaofeng Wang, Yang Li, Yifan Nie, Yifei Li, Yilong Zhou, Yukun Ye, Yun Liu, Zhichao Zhu, Zheng
contents	Vision-language-action (VLA) models that directly predict multi-step action chunks from current observations face inherent limitations due to constrained scene understanding and weak future anticipation capabilities. In contrast, video world models pre-trained on web-scale video corpora exhibit robust spatiotemporal reasoning and accurate future prediction, making them a natural foundation for enhancing VLA learning. Therefore, we propose \textit{GigaBrain-0.5M}, a VLA model trained via world model-based reinforcement learning. Built upon \textit{GigaBrain-0.5}, which is pre-trained on over 10,000 hours of robotic manipulation data, whose intermediate version currently ranks first on the international RoboChallenge benchmark. \textit{GigaBrain-0.5M} further integrates world model-based reinforcement learning via \textit{RAMP} (Reinforcement leArning via world Model-conditioned Policy) to enable robust cross-task adaptation. Empirical results demonstrate that \textit{RAMP} achieves substantial performance gains over the RECAP baseline, yielding improvements of approximately 30\% on challenging tasks including \texttt{Laundry Folding}, \texttt{Box Packing}, and \texttt{Espresso Preparation}. Critically, \textit{GigaBrain-0.5M$^*$} exhibits reliable long-horizon execution, consistently accomplishing complex manipulation tasks without failure as validated by real-world deployment videos on our \href{https://gigabrain05m.github.io}{project page}.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_12099
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	GigaBrain-0.5M: a VLA That Learns From World Model-Based Reinforcement Learning GigaBrain Team Wang, Boyuan Li, Bohan Ni, Chaojun Huang, Guan Zhao, Guosheng Li, Hao Li, Jie Lv, Jindi Liu, Jingyu Feng, Lv Yu, Mingming Li, Peng Deng, Qiuping Liu, Tianze Zhou, Xinyu Chen, Xinze Wang, Xiaofeng Wang, Yang Li, Yifan Nie, Yifei Li, Yilong Zhou, Yukun Ye, Yun Liu, Zhichao Zhu, Zheng Computer Vision and Pattern Recognition Vision-language-action (VLA) models that directly predict multi-step action chunks from current observations face inherent limitations due to constrained scene understanding and weak future anticipation capabilities. In contrast, video world models pre-trained on web-scale video corpora exhibit robust spatiotemporal reasoning and accurate future prediction, making them a natural foundation for enhancing VLA learning. Therefore, we propose \textit{GigaBrain-0.5M}, a VLA model trained via world model-based reinforcement learning. Built upon \textit{GigaBrain-0.5}, which is pre-trained on over 10,000 hours of robotic manipulation data, whose intermediate version currently ranks first on the international RoboChallenge benchmark. \textit{GigaBrain-0.5M} further integrates world model-based reinforcement learning via \textit{RAMP} (Reinforcement leArning via world Model-conditioned Policy) to enable robust cross-task adaptation. Empirical results demonstrate that \textit{RAMP} achieves substantial performance gains over the RECAP baseline, yielding improvements of approximately 30\% on challenging tasks including \texttt{Laundry Folding}, \texttt{Box Packing}, and \texttt{Espresso Preparation}. Critically, \textit{GigaBrain-0.5M$^$} exhibits reliable long-horizon execution, consistently accomplishing complex manipulation tasks without failure as validated by real-world deployment videos on our \href{https://gigabrain05m.github.io}{project page}.
title	GigaBrain-0.5M*: a VLA That Learns From World Model-Based Reinforcement Learning
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2602.12099

Documenti analoghi