Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Li, Jiachen, Gao, Qiaozi, Johnston, Michael, Gao, Xiaofeng, He, Xuehai, Shakiah, Suhaila, Shi, Hangjie, Ghanadan, Reza, Wang, William Yang
Format:	Preprint
Published:	2023
Subjects:	Robotics Artificial Intelligence
Online Access:	https://arxiv.org/abs/2310.09676
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911889431199744
author	Li, Jiachen Gao, Qiaozi Johnston, Michael Gao, Xiaofeng He, Xuehai Shakiah, Suhaila Shi, Hangjie Ghanadan, Reza Wang, William Yang
author_facet	Li, Jiachen Gao, Qiaozi Johnston, Michael Gao, Xiaofeng He, Xuehai Shakiah, Suhaila Shi, Hangjie Ghanadan, Reza Wang, William Yang
contents	Prompt-based learning has been demonstrated as a compelling paradigm contributing to large language models' tremendous success (LLMs). Inspired by their success in language tasks, existing research has leveraged LLMs in embodied instruction following and task planning. In this work, we tackle the problem of training a robot to understand multimodal prompts, interleaving vision signals with text descriptions. This type of task poses a major challenge to robots' capability to understand the interconnection and complementarity between vision and language signals. In this work, we introduce an effective framework that learns a policy to perform robot manipulation with multimodal prompts from multi-task expert trajectories. Our methods consist of a two-stage training pipeline that performs inverse dynamics pretraining and multi-task finetuning. To facilitate multimodal understanding, we design our multimodal prompt encoder by augmenting a pretrained LM with a residual connection to the visual input and model the dependencies among action dimensions. Empirically, we evaluate the efficacy of our method on the VIMA-BENCH and establish a new state-of-the-art (10% improvement in success rate). Moreover, we demonstrate that our model exhibits remarkable in-context learning ability. Project page: \url{https://midas-icml.github.io/}.
format	Preprint
id	arxiv_https___arxiv_org_abs_2310_09676
institution	arXiv
publishDate	2023
record_format	arxiv
spellingShingle	Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning Li, Jiachen Gao, Qiaozi Johnston, Michael Gao, Xiaofeng He, Xuehai Shakiah, Suhaila Shi, Hangjie Ghanadan, Reza Wang, William Yang Robotics Artificial Intelligence Prompt-based learning has been demonstrated as a compelling paradigm contributing to large language models' tremendous success (LLMs). Inspired by their success in language tasks, existing research has leveraged LLMs in embodied instruction following and task planning. In this work, we tackle the problem of training a robot to understand multimodal prompts, interleaving vision signals with text descriptions. This type of task poses a major challenge to robots' capability to understand the interconnection and complementarity between vision and language signals. In this work, we introduce an effective framework that learns a policy to perform robot manipulation with multimodal prompts from multi-task expert trajectories. Our methods consist of a two-stage training pipeline that performs inverse dynamics pretraining and multi-task finetuning. To facilitate multimodal understanding, we design our multimodal prompt encoder by augmenting a pretrained LM with a residual connection to the visual input and model the dependencies among action dimensions. Empirically, we evaluate the efficacy of our method on the VIMA-BENCH and establish a new state-of-the-art (10% improvement in success rate). Moreover, we demonstrate that our model exhibits remarkable in-context learning ability. Project page: \url{https://midas-icml.github.io/}.
title	Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning
topic	Robotics Artificial Intelligence
url	https://arxiv.org/abs/2310.09676

Similar Items