Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	MotuBrain Team, Xiang, Chendong, Bao, Fan, Liu, Haitian, Tan, Hengkai, Bi, Hongzhe, Li, James, Liu, Jiabao, Pang, Jingrui, Jing, Kiro, Liu, Louis, Cai, Mengchen, Cui, Rongxu, Zhao, Ruowen, Wang, Runqing, Huang, Shuhe, Feng, Yao, Rong, Yinze, Wang, Zeyuan, Zhu, Jun
Format:	Preprint
Published:	2026
Subjects:	Robotics
Online Access:	https://arxiv.org/abs/2604.27792
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913079358390272
author	MotuBrain Team Xiang, Chendong Bao, Fan Liu, Haitian Tan, Hengkai Bi, Hongzhe Li, James Liu, Jiabao Pang, Jingrui Jing, Kiro Liu, Louis Cai, Mengchen Cui, Rongxu Zhao, Ruowen Wang, Runqing Huang, Shuhe Feng, Yao Rong, Yinze Wang, Zeyuan Zhu, Jun
author_facet	MotuBrain Team Xiang, Chendong Bao, Fan Liu, Haitian Tan, Hengkai Bi, Hongzhe Li, James Liu, Jiabao Pang, Jingrui Jing, Kiro Liu, Louis Cai, Mengchen Cui, Rongxu Zhao, Ruowen Wang, Runqing Huang, Shuhe Feng, Yao Rong, Yinze Wang, Zeyuan Zhu, Jun
contents	Vision-Language-Action (VLA) models generalize semantically well but often lack fine-grained modeling of world dynamics. We present MotuBrain, a unified World Action Model that jointly models video and action under a UniDiffuser formulation with a three-stream Mixture-of-Transformers architecture. A single model supports policy learning, world modeling, video generation, inverse dynamics, and joint video-action prediction, while scaling to heterogeneous multimodal data such as video-only, task-agnostic, and cross-embodiment robot data. Building on Motus, MotuBrain further introduces unified multiview modeling, an independent text stream for stronger language-action coupling, a shared cross-embodiment action representation, and an efficient post-training and deployment recipe for long-horizon real-world control. Our inference stack combines step reduction, compilation, FP8 quantization, DiT caching, V2A-style action-only inference, and real-time chunked closed-loop execution, achieving over 50x speedup over a naive baseline and up to 11 Hz inference. Experimentally, MotuBrain achieves 95.8% and 96.1% average success on RoboTwin 2.0 under clean and randomized settings, respectively, attains the strongest reported EWMScore in our WorldArena comparison, and adapts to new humanoid embodiments with only 50--100 trajectories. These results show that unified world action models can scale in generality, predictive accuracy, and real-world deployability.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_27792
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	MotuBrain: An Advanced World Action Model for Robot Control MotuBrain Team Xiang, Chendong Bao, Fan Liu, Haitian Tan, Hengkai Bi, Hongzhe Li, James Liu, Jiabao Pang, Jingrui Jing, Kiro Liu, Louis Cai, Mengchen Cui, Rongxu Zhao, Ruowen Wang, Runqing Huang, Shuhe Feng, Yao Rong, Yinze Wang, Zeyuan Zhu, Jun Robotics Vision-Language-Action (VLA) models generalize semantically well but often lack fine-grained modeling of world dynamics. We present MotuBrain, a unified World Action Model that jointly models video and action under a UniDiffuser formulation with a three-stream Mixture-of-Transformers architecture. A single model supports policy learning, world modeling, video generation, inverse dynamics, and joint video-action prediction, while scaling to heterogeneous multimodal data such as video-only, task-agnostic, and cross-embodiment robot data. Building on Motus, MotuBrain further introduces unified multiview modeling, an independent text stream for stronger language-action coupling, a shared cross-embodiment action representation, and an efficient post-training and deployment recipe for long-horizon real-world control. Our inference stack combines step reduction, compilation, FP8 quantization, DiT caching, V2A-style action-only inference, and real-time chunked closed-loop execution, achieving over 50x speedup over a naive baseline and up to 11 Hz inference. Experimentally, MotuBrain achieves 95.8% and 96.1% average success on RoboTwin 2.0 under clean and randomized settings, respectively, attains the strongest reported EWMScore in our WorldArena comparison, and adapts to new humanoid embodiments with only 50--100 trajectories. These results show that unified world action models can scale in generality, predictive accuracy, and real-world deployability.
title	MotuBrain: An Advanced World Action Model for Robot Control
topic	Robotics
url	https://arxiv.org/abs/2604.27792

Similar Items