Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Hao, Ce, Lin, Kelvin, Xue, Zhiwei, Luo, Siyuan, Soh, Harold
Format:	Preprint
Published:	2024
Subjects:	Robotics
Online Access:	https://arxiv.org/abs/2406.09767
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909742126858240
author	Hao, Ce Lin, Kelvin Xue, Zhiwei Luo, Siyuan Soh, Harold
author_facet	Hao, Ce Lin, Kelvin Xue, Zhiwei Luo, Siyuan Soh, Harold
contents	Diffusion policies have demonstrated strong performance in generative modeling, making them promising for robotic manipulation guided by natural language instructions. However, generalizing language-conditioned diffusion policies to open-vocabulary instructions in everyday scenarios remains challenging due to the scarcity and cost of robot demonstration datasets. To address this, we propose DISCO, a framework that leverages off-the-shelf vision-language models (VLMs) to bridge natural language understanding with high-performance diffusion policies. DISCO translates linguistic task descriptions into actionable 3D keyframes using VLMs, which then guide the diffusion process through constrained inpainting. However, enforcing strict adherence to these keyframes can degrade performance when the VLM-generated keyframes are inaccurate. To mitigate this, we introduce an inpainting optimization strategy that balances keyframe adherence with learned motion priors from training data. Experimental results in both simulated and real-world settings demonstrate that DISCO outperforms conventional fine-tuned language-conditioned policies, achieving superior generalization in zero-shot, open-vocabulary manipulation tasks.
format	Preprint
id	arxiv_https___arxiv_org_abs_2406_09767
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	DISCO: Language-Guided Manipulation with Diffusion Policies and Constrained Inpainting Hao, Ce Lin, Kelvin Xue, Zhiwei Luo, Siyuan Soh, Harold Robotics Diffusion policies have demonstrated strong performance in generative modeling, making them promising for robotic manipulation guided by natural language instructions. However, generalizing language-conditioned diffusion policies to open-vocabulary instructions in everyday scenarios remains challenging due to the scarcity and cost of robot demonstration datasets. To address this, we propose DISCO, a framework that leverages off-the-shelf vision-language models (VLMs) to bridge natural language understanding with high-performance diffusion policies. DISCO translates linguistic task descriptions into actionable 3D keyframes using VLMs, which then guide the diffusion process through constrained inpainting. However, enforcing strict adherence to these keyframes can degrade performance when the VLM-generated keyframes are inaccurate. To mitigate this, we introduce an inpainting optimization strategy that balances keyframe adherence with learned motion priors from training data. Experimental results in both simulated and real-world settings demonstrate that DISCO outperforms conventional fine-tuned language-conditioned policies, achieving superior generalization in zero-shot, open-vocabulary manipulation tasks.
title	DISCO: Language-Guided Manipulation with Diffusion Policies and Constrained Inpainting
topic	Robotics
url	https://arxiv.org/abs/2406.09767

Similar Items