Affichage MARC: :: Library Catalog

Enregistré dans:

Détails bibliographiques
Auteurs principaux:	Fu, Rao, Liu, Jingyu, Chen, Xilun, Nie, Yixin, Xiong, Wenhan
Format:	Preprint
Publié:	2024
Sujets:	Computer Vision and Pattern Recognition Artificial Intelligence
Accès en ligne:	https://arxiv.org/abs/2403.11401
Tags:	Ajouter un tag Pas de tags, Soyez le premier à ajouter un tag!

_version_	1866929286460473344
author	Fu, Rao Liu, Jingyu Chen, Xilun Nie, Yixin Xiong, Wenhan
author_facet	Fu, Rao Liu, Jingyu Chen, Xilun Nie, Yixin Xiong, Wenhan
contents	This paper introduces Scene-LLM, a 3D-visual-language model that enhances embodied agents' abilities in interactive 3D indoor environments by integrating the reasoning strengths of Large Language Models (LLMs). Scene-LLM adopts a hybrid 3D visual feature representation, that incorporates dense spatial information and supports scene state updates. The model employs a projection layer to efficiently project these features in the pre-trained textual embedding space, enabling effective interpretation of 3D visual information. Unique to our approach is the integration of both scene-level and ego-centric 3D information. This combination is pivotal for interactive planning, where scene-level data supports global planning and ego-centric data is important for localization. Notably, we use ego-centric 3D frame features for feature alignment, an efficient technique that enhances the model's ability to align features of small objects within the scene. Our experiments with Scene-LLM demonstrate its strong capabilities in dense captioning, question answering, and interactive planning. We believe Scene-LLM advances the field of 3D visual understanding and reasoning, offering new possibilities for sophisticated agent interactions in indoor settings.
format	Preprint
id	arxiv_https___arxiv_org_abs_2403_11401
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning Fu, Rao Liu, Jingyu Chen, Xilun Nie, Yixin Xiong, Wenhan Computer Vision and Pattern Recognition Artificial Intelligence This paper introduces Scene-LLM, a 3D-visual-language model that enhances embodied agents' abilities in interactive 3D indoor environments by integrating the reasoning strengths of Large Language Models (LLMs). Scene-LLM adopts a hybrid 3D visual feature representation, that incorporates dense spatial information and supports scene state updates. The model employs a projection layer to efficiently project these features in the pre-trained textual embedding space, enabling effective interpretation of 3D visual information. Unique to our approach is the integration of both scene-level and ego-centric 3D information. This combination is pivotal for interactive planning, where scene-level data supports global planning and ego-centric data is important for localization. Notably, we use ego-centric 3D frame features for feature alignment, an efficient technique that enhances the model's ability to align features of small objects within the scene. Our experiments with Scene-LLM demonstrate its strong capabilities in dense captioning, question answering, and interactive planning. We believe Scene-LLM advances the field of 3D visual understanding and reasoning, offering new possibilities for sophisticated agent interactions in indoor settings.
title	Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning
topic	Computer Vision and Pattern Recognition Artificial Intelligence
url	https://arxiv.org/abs/2403.11401

Documents similaires