Saved in:
Bibliographic Details
Main Authors: Ding, Pengxiang, Ma, Jianfei, Tong, Xinyang, Zou, Binghong, Luo, Xinxin, Fan, Yiguo, Wang, Ting, Lu, Hongchao, Mo, Panzhong, Liu, Jinxin, Wang, Yuefan, Zhou, Huaicheng, Feng, Wenshuo, Liu, Jiacheng, Huang, Siteng, Wang, Donglin
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2502.14795
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866913701069586432
author Ding, Pengxiang
Ma, Jianfei
Tong, Xinyang
Zou, Binghong
Luo, Xinxin
Fan, Yiguo
Wang, Ting
Lu, Hongchao
Mo, Panzhong
Liu, Jinxin
Wang, Yuefan
Zhou, Huaicheng
Feng, Wenshuo
Liu, Jiacheng
Huang, Siteng
Wang, Donglin
author_facet Ding, Pengxiang
Ma, Jianfei
Tong, Xinyang
Zou, Binghong
Luo, Xinxin
Fan, Yiguo
Wang, Ting
Lu, Hongchao
Mo, Panzhong
Liu, Jinxin
Wang, Yuefan
Zhou, Huaicheng
Feng, Wenshuo
Liu, Jiacheng
Huang, Siteng
Wang, Donglin
contents This paper addresses the limitations of current humanoid robot control frameworks, which primarily rely on reactive mechanisms and lack autonomous interaction capabilities due to data scarcity. We propose Humanoid-VLA, a novel framework that integrates language understanding, egocentric scene perception, and motion control, enabling universal humanoid control. Humanoid-VLA begins with language-motion pre-alignment using non-egocentric human motion datasets paired with textual descriptions, allowing the model to learn universal motion patterns and action semantics. We then incorporate egocentric visual context through a parameter efficient video-conditioned fine-tuning, enabling context-aware motion generation. Furthermore, we introduce a self-supervised data augmentation strategy that automatically generates pseudoannotations directly derived from motion data. This process converts raw motion sequences into informative question-answer pairs, facilitating the effective use of large-scale unlabeled video data. Built upon whole-body control architectures, extensive experiments show that Humanoid-VLA achieves object interaction and environment exploration tasks with enhanced contextual awareness, demonstrating a more human-like capacity for adaptive and intelligent engagement.
format Preprint
id arxiv_https___arxiv_org_abs_2502_14795
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Humanoid-VLA: Towards Universal Humanoid Control with Visual Integration
Ding, Pengxiang
Ma, Jianfei
Tong, Xinyang
Zou, Binghong
Luo, Xinxin
Fan, Yiguo
Wang, Ting
Lu, Hongchao
Mo, Panzhong
Liu, Jinxin
Wang, Yuefan
Zhou, Huaicheng
Feng, Wenshuo
Liu, Jiacheng
Huang, Siteng
Wang, Donglin
Robotics
Computer Vision and Pattern Recognition
This paper addresses the limitations of current humanoid robot control frameworks, which primarily rely on reactive mechanisms and lack autonomous interaction capabilities due to data scarcity. We propose Humanoid-VLA, a novel framework that integrates language understanding, egocentric scene perception, and motion control, enabling universal humanoid control. Humanoid-VLA begins with language-motion pre-alignment using non-egocentric human motion datasets paired with textual descriptions, allowing the model to learn universal motion patterns and action semantics. We then incorporate egocentric visual context through a parameter efficient video-conditioned fine-tuning, enabling context-aware motion generation. Furthermore, we introduce a self-supervised data augmentation strategy that automatically generates pseudoannotations directly derived from motion data. This process converts raw motion sequences into informative question-answer pairs, facilitating the effective use of large-scale unlabeled video data. Built upon whole-body control architectures, extensive experiments show that Humanoid-VLA achieves object interaction and environment exploration tasks with enhanced contextual awareness, demonstrating a more human-like capacity for adaptive and intelligent engagement.
title Humanoid-VLA: Towards Universal Humanoid Control with Visual Integration
topic Robotics
Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2502.14795