Vista Equipo: :: Library Catalog

Guardado en:

Detalles Bibliográficos
Autores principales:	Zhu, Ziyu, Wang, Xilin, Li, Yixuan, Zhang, Zhuofan, Ma, Xiaojian, Chen, Yixin, Jia, Baoxiong, Liang, Wei, Yu, Qian, Deng, Zhidong, Huang, Siyuan, Li, Qing
Formato:	Preprint
Publicado:	2025
Materias:	Computer Vision and Pattern Recognition
Acceso en línea:	https://arxiv.org/abs/2507.04047
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

_version_	1866913965611679744
author	Zhu, Ziyu Wang, Xilin Li, Yixuan Zhang, Zhuofan Ma, Xiaojian Chen, Yixin Jia, Baoxiong Liang, Wei Yu, Qian Deng, Zhidong Huang, Siyuan Li, Qing
author_facet	Zhu, Ziyu Wang, Xilin Li, Yixuan Zhang, Zhuofan Ma, Xiaojian Chen, Yixin Jia, Baoxiong Liang, Wei Yu, Qian Deng, Zhidong Huang, Siyuan Li, Qing
contents	Embodied scene understanding requires not only comprehending visual-spatial information that has been observed but also determining where to explore next in the 3D physical world. Existing 3D Vision-Language (3D-VL) models primarily focus on grounding objects in static observations from 3D reconstruction, such as meshes and point clouds, but lack the ability to actively perceive and explore their environment. To address this limitation, we introduce \underline{\textbf{M}}ove \underline{\textbf{t}}o \underline{\textbf{U}}nderstand (\textbf{\model}), a unified framework that integrates active perception with \underline{\textbf{3D}} vision-language learning, enabling embodied agents to effectively explore and understand their environment. This is achieved by three key innovations: 1) Online query-based representation learning, enabling direct spatial memory construction from RGB-D frames, eliminating the need for explicit 3D reconstruction. 2) A unified objective for grounding and exploring, which represents unexplored locations as frontier queries and jointly optimizes object grounding and frontier selection. 3) End-to-end trajectory learning that combines \textbf{V}ision-\textbf{L}anguage-\textbf{E}xploration pre-training over a million diverse trajectories collected from both simulated and real-world RGB-D sequences. Extensive evaluations across various embodied navigation and question-answering benchmarks show that MTU3D outperforms state-of-the-art reinforcement learning and modular navigation approaches by 14\%, 23\%, 9\%, and 2\% in success rate on HM3D-OVON, GOAT-Bench, SG3D, and A-EQA, respectively. \model's versatility enables navigation using diverse input modalities, including categories, language descriptions, and reference images. These findings highlight the importance of bridging visual grounding and exploration for embodied intelligence.
format	Preprint
id	arxiv_https___arxiv_org_abs_2507_04047
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation Zhu, Ziyu Wang, Xilin Li, Yixuan Zhang, Zhuofan Ma, Xiaojian Chen, Yixin Jia, Baoxiong Liang, Wei Yu, Qian Deng, Zhidong Huang, Siyuan Li, Qing Computer Vision and Pattern Recognition Embodied scene understanding requires not only comprehending visual-spatial information that has been observed but also determining where to explore next in the 3D physical world. Existing 3D Vision-Language (3D-VL) models primarily focus on grounding objects in static observations from 3D reconstruction, such as meshes and point clouds, but lack the ability to actively perceive and explore their environment. To address this limitation, we introduce \underline{\textbf{M}}ove \underline{\textbf{t}}o \underline{\textbf{U}}nderstand (\textbf{\model}), a unified framework that integrates active perception with \underline{\textbf{3D}} vision-language learning, enabling embodied agents to effectively explore and understand their environment. This is achieved by three key innovations: 1) Online query-based representation learning, enabling direct spatial memory construction from RGB-D frames, eliminating the need for explicit 3D reconstruction. 2) A unified objective for grounding and exploring, which represents unexplored locations as frontier queries and jointly optimizes object grounding and frontier selection. 3) End-to-end trajectory learning that combines \textbf{V}ision-\textbf{L}anguage-\textbf{E}xploration pre-training over a million diverse trajectories collected from both simulated and real-world RGB-D sequences. Extensive evaluations across various embodied navigation and question-answering benchmarks show that MTU3D outperforms state-of-the-art reinforcement learning and modular navigation approaches by 14\%, 23\%, 9\%, and 2\% in success rate on HM3D-OVON, GOAT-Bench, SG3D, and A-EQA, respectively. \model's versatility enables navigation using diverse input modalities, including categories, language descriptions, and reference images. These findings highlight the importance of bridging visual grounding and exploration for embodied intelligence.
title	Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2507.04047

Ejemplares similares