Table des matières: :: Library Catalog

Enregistré dans:

Détails bibliographiques
Auteurs principaux:	Lian, Shijie, Yu, Bin, Lin, Xiaopeng, Wu, Changti, Yuan, Hang, Hu, Xiaolin, Shen, Zhaolong, Miao, Yuzhuo, Liu, Haishan, Tian, Yuxuan, Shi, Yukun, Huang, Cong, Chen, Kai
Format:	Preprint
Publié:	2026
Sujets:	Robotics Artificial Intelligence Computation and Language Computer Vision and Pattern Recognition
Accès en ligne:	https://arxiv.org/abs/2605.15298
Tags:	Ajouter un tag Pas de tags, Soyez le premier à ajouter un tag!

Table des matières:

Vision-language-action models have advanced rapidly, but robot trajectories alone provide limited coverage for learning broad physical understanding. PhysBrain 1.0 studies a complementary route: converting large-scale human egocentric video into structured physical commonsense supervision before robot adaptation. Our data engine extracts scene elements, spatial dynamics, action execution, and depth-aware relations, then turns them into question-answer supervision for training PhysBrain VLMs. The resulting physical priors are further transferred to VLA policies through a capability-preserving and language-sensitive adaptation design. Across multimodal QA benchmarks and embodied control benchmarks, including ERQA, PhysBench, SimplerEnv-WidowX, LIBERO, and RoboCasa, PhysBrain 1.0 achieves SOTA results and shows especially strong out-of-domain performance on SimplerEnv. These results suggest that scaling physical commonsense from human interaction video can provide an effective bridge from multimodal understanding to robot action.

Documents similaires