Saved in:
Bibliographic Details
Main Authors: Li, Yi, Deng, Yuquan, Zhang, Jesse, Jang, Joel, Memmel, Marius, Yu, Raymond, Garrett, Caelan Reed, Ramos, Fabio, Fox, Dieter, Li, Anqi, Gupta, Abhishek, Goyal, Ankit
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2502.05485
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866915281808392192
author Li, Yi
Deng, Yuquan
Zhang, Jesse
Jang, Joel
Memmel, Marius
Yu, Raymond
Garrett, Caelan Reed
Ramos, Fabio
Fox, Dieter
Li, Anqi
Gupta, Abhishek
Goyal, Ankit
author_facet Li, Yi
Deng, Yuquan
Zhang, Jesse
Jang, Joel
Memmel, Marius
Yu, Raymond
Garrett, Caelan Reed
Ramos, Fabio
Fox, Dieter
Li, Anqi
Gupta, Abhishek
Goyal, Ankit
contents Large foundation models have shown strong open-world generalization to complex problems in vision and language, but similar levels of generalization have yet to be achieved in robotics. One fundamental challenge is the lack of robotic data, which are typically obtained through expensive on-robot operation. A promising remedy is to leverage cheaper, off-domain data such as action-free videos, hand-drawn sketches or simulation data. In this work, we posit that hierarchical vision-language-action (VLA) models can be more effective in utilizing off-domain data than standard monolithic VLA models that directly finetune vision-language models (VLMs) to predict actions. In particular, we study a class of hierarchical VLA models, where the high-level VLM is finetuned to produce a coarse 2D path indicating the desired robot end-effector trajectory given an RGB image and a task description. The intermediate 2D path prediction is then served as guidance to the low-level, 3D-aware control policy capable of precise manipulation. Doing so alleviates the high-level VLM from fine-grained action prediction, while reducing the low-level policy's burden on complex task-level reasoning. We show that, with the hierarchical design, the high-level VLM can transfer across significant domain gaps between the off-domain finetuning data and real-robot testing scenarios, including differences on embodiments, dynamics, visual appearances and task semantics, etc. In the real-robot experiments, we observe an average of 20% improvement in success rate across seven different axes of generalization over OpenVLA, representing a 50% relative gain. Visual results, code, and dataset are provided at: https://hamster-robot.github.io/
format Preprint
id arxiv_https___arxiv_org_abs_2502_05485
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle HAMSTER: Hierarchical Action Models For Open-World Robot Manipulation
Li, Yi
Deng, Yuquan
Zhang, Jesse
Jang, Joel
Memmel, Marius
Yu, Raymond
Garrett, Caelan Reed
Ramos, Fabio
Fox, Dieter
Li, Anqi
Gupta, Abhishek
Goyal, Ankit
Robotics
Artificial Intelligence
Computer Vision and Pattern Recognition
Large foundation models have shown strong open-world generalization to complex problems in vision and language, but similar levels of generalization have yet to be achieved in robotics. One fundamental challenge is the lack of robotic data, which are typically obtained through expensive on-robot operation. A promising remedy is to leverage cheaper, off-domain data such as action-free videos, hand-drawn sketches or simulation data. In this work, we posit that hierarchical vision-language-action (VLA) models can be more effective in utilizing off-domain data than standard monolithic VLA models that directly finetune vision-language models (VLMs) to predict actions. In particular, we study a class of hierarchical VLA models, where the high-level VLM is finetuned to produce a coarse 2D path indicating the desired robot end-effector trajectory given an RGB image and a task description. The intermediate 2D path prediction is then served as guidance to the low-level, 3D-aware control policy capable of precise manipulation. Doing so alleviates the high-level VLM from fine-grained action prediction, while reducing the low-level policy's burden on complex task-level reasoning. We show that, with the hierarchical design, the high-level VLM can transfer across significant domain gaps between the off-domain finetuning data and real-robot testing scenarios, including differences on embodiments, dynamics, visual appearances and task semantics, etc. In the real-robot experiments, we observe an average of 20% improvement in success rate across seven different axes of generalization over OpenVLA, representing a 50% relative gain. Visual results, code, and dataset are provided at: https://hamster-robot.github.io/
title HAMSTER: Hierarchical Action Models For Open-World Robot Manipulation
topic Robotics
Artificial Intelligence
Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2502.05485