Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Ding, Pengxiang, Ma, Jianfei, Tong, Xinyang, Zou, Binghong, Luo, Xinxin, Fan, Yiguo, Wang, Ting, Lu, Hongchao, Mo, Panzhong, Liu, Jinxin, Wang, Yuefan, Zhou, Huaicheng, Feng, Wenshuo, Liu, Jiacheng, Huang, Siteng, Wang, Donglin
Format:	Preprint
Published:	2025
Subjects:	Robotics Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2502.14795
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913701069586432
author	Ding, Pengxiang Ma, Jianfei Tong, Xinyang Zou, Binghong Luo, Xinxin Fan, Yiguo Wang, Ting Lu, Hongchao Mo, Panzhong Liu, Jinxin Wang, Yuefan Zhou, Huaicheng Feng, Wenshuo Liu, Jiacheng Huang, Siteng Wang, Donglin
author_facet	Ding, Pengxiang Ma, Jianfei Tong, Xinyang Zou, Binghong Luo, Xinxin Fan, Yiguo Wang, Ting Lu, Hongchao Mo, Panzhong Liu, Jinxin Wang, Yuefan Zhou, Huaicheng Feng, Wenshuo Liu, Jiacheng Huang, Siteng Wang, Donglin
contents	This paper addresses the limitations of current humanoid robot control frameworks, which primarily rely on reactive mechanisms and lack autonomous interaction capabilities due to data scarcity. We propose Humanoid-VLA, a novel framework that integrates language understanding, egocentric scene perception, and motion control, enabling universal humanoid control. Humanoid-VLA begins with language-motion pre-alignment using non-egocentric human motion datasets paired with textual descriptions, allowing the model to learn universal motion patterns and action semantics. We then incorporate egocentric visual context through a parameter efficient video-conditioned fine-tuning, enabling context-aware motion generation. Furthermore, we introduce a self-supervised data augmentation strategy that automatically generates pseudoannotations directly derived from motion data. This process converts raw motion sequences into informative question-answer pairs, facilitating the effective use of large-scale unlabeled video data. Built upon whole-body control architectures, extensive experiments show that Humanoid-VLA achieves object interaction and environment exploration tasks with enhanced contextual awareness, demonstrating a more human-like capacity for adaptive and intelligent engagement.
format	Preprint
id	arxiv_https___arxiv_org_abs_2502_14795
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Humanoid-VLA: Towards Universal Humanoid Control with Visual Integration Ding, Pengxiang Ma, Jianfei Tong, Xinyang Zou, Binghong Luo, Xinxin Fan, Yiguo Wang, Ting Lu, Hongchao Mo, Panzhong Liu, Jinxin Wang, Yuefan Zhou, Huaicheng Feng, Wenshuo Liu, Jiacheng Huang, Siteng Wang, Donglin Robotics Computer Vision and Pattern Recognition This paper addresses the limitations of current humanoid robot control frameworks, which primarily rely on reactive mechanisms and lack autonomous interaction capabilities due to data scarcity. We propose Humanoid-VLA, a novel framework that integrates language understanding, egocentric scene perception, and motion control, enabling universal humanoid control. Humanoid-VLA begins with language-motion pre-alignment using non-egocentric human motion datasets paired with textual descriptions, allowing the model to learn universal motion patterns and action semantics. We then incorporate egocentric visual context through a parameter efficient video-conditioned fine-tuning, enabling context-aware motion generation. Furthermore, we introduce a self-supervised data augmentation strategy that automatically generates pseudoannotations directly derived from motion data. This process converts raw motion sequences into informative question-answer pairs, facilitating the effective use of large-scale unlabeled video data. Built upon whole-body control architectures, extensive experiments show that Humanoid-VLA achieves object interaction and environment exploration tasks with enhanced contextual awareness, demonstrating a more human-like capacity for adaptive and intelligent engagement.
title	Humanoid-VLA: Towards Universal Humanoid Control with Visual Integration
topic	Robotics Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2502.14795

Similar Items