Saved in:
| Main Authors: | , , , , , , , , , , , , , , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.27792 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866913079358390272 |
|---|---|
| author | MotuBrain Team Xiang, Chendong Bao, Fan Liu, Haitian Tan, Hengkai Bi, Hongzhe Li, James Liu, Jiabao Pang, Jingrui Jing, Kiro Liu, Louis Cai, Mengchen Cui, Rongxu Zhao, Ruowen Wang, Runqing Huang, Shuhe Feng, Yao Rong, Yinze Wang, Zeyuan Zhu, Jun |
| author_facet | MotuBrain Team Xiang, Chendong Bao, Fan Liu, Haitian Tan, Hengkai Bi, Hongzhe Li, James Liu, Jiabao Pang, Jingrui Jing, Kiro Liu, Louis Cai, Mengchen Cui, Rongxu Zhao, Ruowen Wang, Runqing Huang, Shuhe Feng, Yao Rong, Yinze Wang, Zeyuan Zhu, Jun |
| contents | Vision-Language-Action (VLA) models generalize semantically well but often lack fine-grained modeling of world dynamics. We present MotuBrain, a unified World Action Model that jointly models video and action under a UniDiffuser formulation with a three-stream Mixture-of-Transformers architecture. A single model supports policy learning, world modeling, video generation, inverse dynamics, and joint video-action prediction, while scaling to heterogeneous multimodal data such as video-only, task-agnostic, and cross-embodiment robot data. Building on Motus, MotuBrain further introduces unified multiview modeling, an independent text stream for stronger language-action coupling, a shared cross-embodiment action representation, and an efficient post-training and deployment recipe for long-horizon real-world control. Our inference stack combines step reduction, compilation, FP8 quantization, DiT caching, V2A-style action-only inference, and real-time chunked closed-loop execution, achieving over 50x speedup over a naive baseline and up to 11 Hz inference. Experimentally, MotuBrain achieves 95.8% and 96.1% average success on RoboTwin 2.0 under clean and randomized settings, respectively, attains the strongest reported EWMScore in our WorldArena comparison, and adapts to new humanoid embodiments with only 50--100 trajectories. These results show that unified world action models can scale in generality, predictive accuracy, and real-world deployability. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2604_27792 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | MotuBrain: An Advanced World Action Model for Robot Control MotuBrain Team Xiang, Chendong Bao, Fan Liu, Haitian Tan, Hengkai Bi, Hongzhe Li, James Liu, Jiabao Pang, Jingrui Jing, Kiro Liu, Louis Cai, Mengchen Cui, Rongxu Zhao, Ruowen Wang, Runqing Huang, Shuhe Feng, Yao Rong, Yinze Wang, Zeyuan Zhu, Jun Robotics Vision-Language-Action (VLA) models generalize semantically well but often lack fine-grained modeling of world dynamics. We present MotuBrain, a unified World Action Model that jointly models video and action under a UniDiffuser formulation with a three-stream Mixture-of-Transformers architecture. A single model supports policy learning, world modeling, video generation, inverse dynamics, and joint video-action prediction, while scaling to heterogeneous multimodal data such as video-only, task-agnostic, and cross-embodiment robot data. Building on Motus, MotuBrain further introduces unified multiview modeling, an independent text stream for stronger language-action coupling, a shared cross-embodiment action representation, and an efficient post-training and deployment recipe for long-horizon real-world control. Our inference stack combines step reduction, compilation, FP8 quantization, DiT caching, V2A-style action-only inference, and real-time chunked closed-loop execution, achieving over 50x speedup over a naive baseline and up to 11 Hz inference. Experimentally, MotuBrain achieves 95.8% and 96.1% average success on RoboTwin 2.0 under clean and randomized settings, respectively, attains the strongest reported EWMScore in our WorldArena comparison, and adapts to new humanoid embodiments with only 50--100 trajectories. These results show that unified world action models can scale in generality, predictive accuracy, and real-world deployability. |
| title | MotuBrain: An Advanced World Action Model for Robot Control |
| topic | Robotics |
| url | https://arxiv.org/abs/2604.27792 |