Saved in:
Bibliographic Details
Main Authors: MotuBrain Team, Xiang, Chendong, Bao, Fan, Liu, Haitian, Tan, Hengkai, Bi, Hongzhe, Li, James, Liu, Jiabao, Pang, Jingrui, Jing, Kiro, Liu, Louis, Cai, Mengchen, Cui, Rongxu, Zhao, Ruowen, Wang, Runqing, Huang, Shuhe, Feng, Yao, Rong, Yinze, Wang, Zeyuan, Zhu, Jun
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2604.27792
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866913079358390272
author MotuBrain Team
Xiang, Chendong
Bao, Fan
Liu, Haitian
Tan, Hengkai
Bi, Hongzhe
Li, James
Liu, Jiabao
Pang, Jingrui
Jing, Kiro
Liu, Louis
Cai, Mengchen
Cui, Rongxu
Zhao, Ruowen
Wang, Runqing
Huang, Shuhe
Feng, Yao
Rong, Yinze
Wang, Zeyuan
Zhu, Jun
author_facet MotuBrain Team
Xiang, Chendong
Bao, Fan
Liu, Haitian
Tan, Hengkai
Bi, Hongzhe
Li, James
Liu, Jiabao
Pang, Jingrui
Jing, Kiro
Liu, Louis
Cai, Mengchen
Cui, Rongxu
Zhao, Ruowen
Wang, Runqing
Huang, Shuhe
Feng, Yao
Rong, Yinze
Wang, Zeyuan
Zhu, Jun
contents Vision-Language-Action (VLA) models generalize semantically well but often lack fine-grained modeling of world dynamics. We present MotuBrain, a unified World Action Model that jointly models video and action under a UniDiffuser formulation with a three-stream Mixture-of-Transformers architecture. A single model supports policy learning, world modeling, video generation, inverse dynamics, and joint video-action prediction, while scaling to heterogeneous multimodal data such as video-only, task-agnostic, and cross-embodiment robot data. Building on Motus, MotuBrain further introduces unified multiview modeling, an independent text stream for stronger language-action coupling, a shared cross-embodiment action representation, and an efficient post-training and deployment recipe for long-horizon real-world control. Our inference stack combines step reduction, compilation, FP8 quantization, DiT caching, V2A-style action-only inference, and real-time chunked closed-loop execution, achieving over 50x speedup over a naive baseline and up to 11 Hz inference. Experimentally, MotuBrain achieves 95.8% and 96.1% average success on RoboTwin 2.0 under clean and randomized settings, respectively, attains the strongest reported EWMScore in our WorldArena comparison, and adapts to new humanoid embodiments with only 50--100 trajectories. These results show that unified world action models can scale in generality, predictive accuracy, and real-world deployability.
format Preprint
id arxiv_https___arxiv_org_abs_2604_27792
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle MotuBrain: An Advanced World Action Model for Robot Control
MotuBrain Team
Xiang, Chendong
Bao, Fan
Liu, Haitian
Tan, Hengkai
Bi, Hongzhe
Li, James
Liu, Jiabao
Pang, Jingrui
Jing, Kiro
Liu, Louis
Cai, Mengchen
Cui, Rongxu
Zhao, Ruowen
Wang, Runqing
Huang, Shuhe
Feng, Yao
Rong, Yinze
Wang, Zeyuan
Zhu, Jun
Robotics
Vision-Language-Action (VLA) models generalize semantically well but often lack fine-grained modeling of world dynamics. We present MotuBrain, a unified World Action Model that jointly models video and action under a UniDiffuser formulation with a three-stream Mixture-of-Transformers architecture. A single model supports policy learning, world modeling, video generation, inverse dynamics, and joint video-action prediction, while scaling to heterogeneous multimodal data such as video-only, task-agnostic, and cross-embodiment robot data. Building on Motus, MotuBrain further introduces unified multiview modeling, an independent text stream for stronger language-action coupling, a shared cross-embodiment action representation, and an efficient post-training and deployment recipe for long-horizon real-world control. Our inference stack combines step reduction, compilation, FP8 quantization, DiT caching, V2A-style action-only inference, and real-time chunked closed-loop execution, achieving over 50x speedup over a naive baseline and up to 11 Hz inference. Experimentally, MotuBrain achieves 95.8% and 96.1% average success on RoboTwin 2.0 under clean and randomized settings, respectively, attains the strongest reported EWMScore in our WorldArena comparison, and adapts to new humanoid embodiments with only 50--100 trajectories. These results show that unified world action models can scale in generality, predictive accuracy, and real-world deployability.
title MotuBrain: An Advanced World Action Model for Robot Control
topic Robotics
url https://arxiv.org/abs/2604.27792