Enregistré dans:
Détails bibliographiques
Auteurs principaux: Fu, Rao, Liu, Jingyu, Chen, Xilun, Nie, Yixin, Xiong, Wenhan
Format: Preprint
Publié: 2024
Sujets:
Accès en ligne:https://arxiv.org/abs/2403.11401
Tags: Ajouter un tag
Pas de tags, Soyez le premier à ajouter un tag!
_version_ 1866929286460473344
author Fu, Rao
Liu, Jingyu
Chen, Xilun
Nie, Yixin
Xiong, Wenhan
author_facet Fu, Rao
Liu, Jingyu
Chen, Xilun
Nie, Yixin
Xiong, Wenhan
contents This paper introduces Scene-LLM, a 3D-visual-language model that enhances embodied agents' abilities in interactive 3D indoor environments by integrating the reasoning strengths of Large Language Models (LLMs). Scene-LLM adopts a hybrid 3D visual feature representation, that incorporates dense spatial information and supports scene state updates. The model employs a projection layer to efficiently project these features in the pre-trained textual embedding space, enabling effective interpretation of 3D visual information. Unique to our approach is the integration of both scene-level and ego-centric 3D information. This combination is pivotal for interactive planning, where scene-level data supports global planning and ego-centric data is important for localization. Notably, we use ego-centric 3D frame features for feature alignment, an efficient technique that enhances the model's ability to align features of small objects within the scene. Our experiments with Scene-LLM demonstrate its strong capabilities in dense captioning, question answering, and interactive planning. We believe Scene-LLM advances the field of 3D visual understanding and reasoning, offering new possibilities for sophisticated agent interactions in indoor settings.
format Preprint
id arxiv_https___arxiv_org_abs_2403_11401
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning
Fu, Rao
Liu, Jingyu
Chen, Xilun
Nie, Yixin
Xiong, Wenhan
Computer Vision and Pattern Recognition
Artificial Intelligence
This paper introduces Scene-LLM, a 3D-visual-language model that enhances embodied agents' abilities in interactive 3D indoor environments by integrating the reasoning strengths of Large Language Models (LLMs). Scene-LLM adopts a hybrid 3D visual feature representation, that incorporates dense spatial information and supports scene state updates. The model employs a projection layer to efficiently project these features in the pre-trained textual embedding space, enabling effective interpretation of 3D visual information. Unique to our approach is the integration of both scene-level and ego-centric 3D information. This combination is pivotal for interactive planning, where scene-level data supports global planning and ego-centric data is important for localization. Notably, we use ego-centric 3D frame features for feature alignment, an efficient technique that enhances the model's ability to align features of small objects within the scene. Our experiments with Scene-LLM demonstrate its strong capabilities in dense captioning, question answering, and interactive planning. We believe Scene-LLM advances the field of 3D visual understanding and reasoning, offering new possibilities for sophisticated agent interactions in indoor settings.
title Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning
topic Computer Vision and Pattern Recognition
Artificial Intelligence
url https://arxiv.org/abs/2403.11401