Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Le, Long, Xie, Jason, Liang, William, Wang, Hung-Ju, Yang, Yue, Ma, Yecheng Jason, Vedder, Kyle, Krishna, Arjun, Jayaraman, Dinesh, Eaton, Eric
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2410.13882
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913869143736320
author	Le, Long Xie, Jason Liang, William Wang, Hung-Ju Yang, Yue Ma, Yecheng Jason Vedder, Kyle Krishna, Arjun Jayaraman, Dinesh Eaton, Eric
author_facet	Le, Long Xie, Jason Liang, William Wang, Hung-Ju Yang, Yue Ma, Yecheng Jason Vedder, Kyle Krishna, Arjun Jayaraman, Dinesh Eaton, Eric
contents	Interactive 3D simulated objects are crucial in AR/VR, animations, and robotics, driving immersive experiences and advanced automation. However, creating these articulated objects requires extensive human effort and expertise, limiting their broader applications. To overcome this challenge, we present Articulate-Anything, a system that automates the articulation of diverse, complex objects from many input modalities, including text, images, and videos. Articulate-Anything leverages vision-language models (VLMs) to generate code that can be compiled into an interactable digital twin for use in standard 3D simulators. Our system exploits existing 3D asset datasets via a mesh retrieval mechanism, along with an actor-critic system that iteratively proposes, evaluates, and refines solutions for articulating the objects, self-correcting errors to achieve a robust outcome. Qualitative evaluations demonstrate Articulate-Anything's capability to articulate complex and even ambiguous object affordances by leveraging rich grounded inputs. In extensive quantitative experiments on the standard PartNet-Mobility dataset, Articulate-Anything substantially outperforms prior work, increasing the success rate from 8.7-11.6% to 75% and setting a new bar for state-of-the-art performance. We further showcase the utility of our system by generating 3D assets from in-the-wild video inputs, which are then used to train robotic policies for fine-grained manipulation tasks in simulation that go beyond basic pick and place. These policies are then transferred to a real robotic system.
format	Preprint
id	arxiv_https___arxiv_org_abs_2410_13882
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Articulate-Anything: Automatic Modeling of Articulated Objects via a Vision-Language Foundation Model Le, Long Xie, Jason Liang, William Wang, Hung-Ju Yang, Yue Ma, Yecheng Jason Vedder, Kyle Krishna, Arjun Jayaraman, Dinesh Eaton, Eric Computer Vision and Pattern Recognition Interactive 3D simulated objects are crucial in AR/VR, animations, and robotics, driving immersive experiences and advanced automation. However, creating these articulated objects requires extensive human effort and expertise, limiting their broader applications. To overcome this challenge, we present Articulate-Anything, a system that automates the articulation of diverse, complex objects from many input modalities, including text, images, and videos. Articulate-Anything leverages vision-language models (VLMs) to generate code that can be compiled into an interactable digital twin for use in standard 3D simulators. Our system exploits existing 3D asset datasets via a mesh retrieval mechanism, along with an actor-critic system that iteratively proposes, evaluates, and refines solutions for articulating the objects, self-correcting errors to achieve a robust outcome. Qualitative evaluations demonstrate Articulate-Anything's capability to articulate complex and even ambiguous object affordances by leveraging rich grounded inputs. In extensive quantitative experiments on the standard PartNet-Mobility dataset, Articulate-Anything substantially outperforms prior work, increasing the success rate from 8.7-11.6% to 75% and setting a new bar for state-of-the-art performance. We further showcase the utility of our system by generating 3D assets from in-the-wild video inputs, which are then used to train robotic policies for fine-grained manipulation tasks in simulation that go beyond basic pick and place. These policies are then transferred to a real robotic system.
title	Articulate-Anything: Automatic Modeling of Articulated Objects via a Vision-Language Foundation Model
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2410.13882

Similar Items