Salvato in:
Dettagli Bibliografici
Autori principali: GigaBrain Team, Wang, Boyuan, Li, Bohan, Ni, Chaojun, Huang, Guan, Zhao, Guosheng, Li, Hao, Li, Jie, Lv, Jindi, Liu, Jingyu, Feng, Lv, Yu, Mingming, Li, Peng, Deng, Qiuping, Liu, Tianze, Zhou, Xinyu, Chen, Xinze, Wang, Xiaofeng, Wang, Yang, Li, Yifan, Nie, Yifei, Li, Yilong, Zhou, Yukun, Ye, Yun, Liu, Zhichao, Zhu, Zheng
Natura: Preprint
Pubblicazione: 2026
Soggetti:
Accesso online:https://arxiv.org/abs/2602.12099
Tags: Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!
_version_ 1866917294723039232
author GigaBrain Team
Wang, Boyuan
Li, Bohan
Ni, Chaojun
Huang, Guan
Zhao, Guosheng
Li, Hao
Li, Jie
Lv, Jindi
Liu, Jingyu
Feng, Lv
Yu, Mingming
Li, Peng
Deng, Qiuping
Liu, Tianze
Zhou, Xinyu
Chen, Xinze
Wang, Xiaofeng
Wang, Yang
Li, Yifan
Nie, Yifei
Li, Yilong
Zhou, Yukun
Ye, Yun
Liu, Zhichao
Zhu, Zheng
author_facet GigaBrain Team
Wang, Boyuan
Li, Bohan
Ni, Chaojun
Huang, Guan
Zhao, Guosheng
Li, Hao
Li, Jie
Lv, Jindi
Liu, Jingyu
Feng, Lv
Yu, Mingming
Li, Peng
Deng, Qiuping
Liu, Tianze
Zhou, Xinyu
Chen, Xinze
Wang, Xiaofeng
Wang, Yang
Li, Yifan
Nie, Yifei
Li, Yilong
Zhou, Yukun
Ye, Yun
Liu, Zhichao
Zhu, Zheng
contents Vision-language-action (VLA) models that directly predict multi-step action chunks from current observations face inherent limitations due to constrained scene understanding and weak future anticipation capabilities. In contrast, video world models pre-trained on web-scale video corpora exhibit robust spatiotemporal reasoning and accurate future prediction, making them a natural foundation for enhancing VLA learning. Therefore, we propose \textit{GigaBrain-0.5M*}, a VLA model trained via world model-based reinforcement learning. Built upon \textit{GigaBrain-0.5}, which is pre-trained on over 10,000 hours of robotic manipulation data, whose intermediate version currently ranks first on the international RoboChallenge benchmark. \textit{GigaBrain-0.5M*} further integrates world model-based reinforcement learning via \textit{RAMP} (Reinforcement leArning via world Model-conditioned Policy) to enable robust cross-task adaptation. Empirical results demonstrate that \textit{RAMP} achieves substantial performance gains over the RECAP baseline, yielding improvements of approximately 30\% on challenging tasks including \texttt{Laundry Folding}, \texttt{Box Packing}, and \texttt{Espresso Preparation}. Critically, \textit{GigaBrain-0.5M$^*$} exhibits reliable long-horizon execution, consistently accomplishing complex manipulation tasks without failure as validated by real-world deployment videos on our \href{https://gigabrain05m.github.io}{project page}.
format Preprint
id arxiv_https___arxiv_org_abs_2602_12099
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle GigaBrain-0.5M*: a VLA That Learns From World Model-Based Reinforcement Learning
GigaBrain Team
Wang, Boyuan
Li, Bohan
Ni, Chaojun
Huang, Guan
Zhao, Guosheng
Li, Hao
Li, Jie
Lv, Jindi
Liu, Jingyu
Feng, Lv
Yu, Mingming
Li, Peng
Deng, Qiuping
Liu, Tianze
Zhou, Xinyu
Chen, Xinze
Wang, Xiaofeng
Wang, Yang
Li, Yifan
Nie, Yifei
Li, Yilong
Zhou, Yukun
Ye, Yun
Liu, Zhichao
Zhu, Zheng
Computer Vision and Pattern Recognition
Vision-language-action (VLA) models that directly predict multi-step action chunks from current observations face inherent limitations due to constrained scene understanding and weak future anticipation capabilities. In contrast, video world models pre-trained on web-scale video corpora exhibit robust spatiotemporal reasoning and accurate future prediction, making them a natural foundation for enhancing VLA learning. Therefore, we propose \textit{GigaBrain-0.5M*}, a VLA model trained via world model-based reinforcement learning. Built upon \textit{GigaBrain-0.5}, which is pre-trained on over 10,000 hours of robotic manipulation data, whose intermediate version currently ranks first on the international RoboChallenge benchmark. \textit{GigaBrain-0.5M*} further integrates world model-based reinforcement learning via \textit{RAMP} (Reinforcement leArning via world Model-conditioned Policy) to enable robust cross-task adaptation. Empirical results demonstrate that \textit{RAMP} achieves substantial performance gains over the RECAP baseline, yielding improvements of approximately 30\% on challenging tasks including \texttt{Laundry Folding}, \texttt{Box Packing}, and \texttt{Espresso Preparation}. Critically, \textit{GigaBrain-0.5M$^*$} exhibits reliable long-horizon execution, consistently accomplishing complex manipulation tasks without failure as validated by real-world deployment videos on our \href{https://gigabrain05m.github.io}{project page}.
title GigaBrain-0.5M*: a VLA That Learns From World Model-Based Reinforcement Learning
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2602.12099