Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Ray, Aaron, Arkin, Jacob, Biggie, Harel, Fan, Chuchu, Carlone, Luca, Roy, Nicholas
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence Robotics I.2.9; I.2.10; H.3.3
Online Access:	https://arxiv.org/abs/2510.16643
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866918163749273600
author	Ray, Aaron Arkin, Jacob Biggie, Harel Fan, Chuchu Carlone, Luca Roy, Nicholas
author_facet	Ray, Aaron Arkin, Jacob Biggie, Harel Fan, Chuchu Carlone, Luca Roy, Nicholas
contents	In order to provide a robot with the ability to understand and react to a user's natural language inputs, the natural language must be connected to the robot's underlying representations of the world. Recently, large language models (LLMs) and 3D scene graphs (3DSGs) have become a popular choice for grounding natural language and representing the world. In this work, we address the challenge of using LLMs with 3DSGs to ground natural language. Existing methods encode the scene graph as serialized text within the LLM's context window, but this encoding does not scale to large or rich 3DSGs. Instead, we propose to use a form of Retrieval Augmented Generation to select a subset of the 3DSG relevant to the task. We encode a 3DSG in a graph database and provide a query language interface (Cypher) as a tool to the LLM with which it can retrieve relevant data for language grounding. We evaluate our approach on instruction following and scene question-answering tasks and compare against baseline context window and code generation methods. Our results show that using Cypher as an interface to 3D scene graphs scales significantly better to large, rich graphs on both local and cloud-based models. This leads to large performance improvements in grounded language tasks while also substantially reducing the token count of the scene graph content. A video supplement is available at https://www.youtube.com/watch?v=zY_YI9giZSA.
format	Preprint
id	arxiv_https___arxiv_org_abs_2510_16643
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Structured Interfaces for Automated Reasoning with 3D Scene Graphs Ray, Aaron Arkin, Jacob Biggie, Harel Fan, Chuchu Carlone, Luca Roy, Nicholas Computer Vision and Pattern Recognition Artificial Intelligence Robotics I.2.9; I.2.10; H.3.3 In order to provide a robot with the ability to understand and react to a user's natural language inputs, the natural language must be connected to the robot's underlying representations of the world. Recently, large language models (LLMs) and 3D scene graphs (3DSGs) have become a popular choice for grounding natural language and representing the world. In this work, we address the challenge of using LLMs with 3DSGs to ground natural language. Existing methods encode the scene graph as serialized text within the LLM's context window, but this encoding does not scale to large or rich 3DSGs. Instead, we propose to use a form of Retrieval Augmented Generation to select a subset of the 3DSG relevant to the task. We encode a 3DSG in a graph database and provide a query language interface (Cypher) as a tool to the LLM with which it can retrieve relevant data for language grounding. We evaluate our approach on instruction following and scene question-answering tasks and compare against baseline context window and code generation methods. Our results show that using Cypher as an interface to 3D scene graphs scales significantly better to large, rich graphs on both local and cloud-based models. This leads to large performance improvements in grounded language tasks while also substantially reducing the token count of the scene graph content. A video supplement is available at https://www.youtube.com/watch?v=zY_YI9giZSA.
title	Structured Interfaces for Automated Reasoning with 3D Scene Graphs
topic	Computer Vision and Pattern Recognition Artificial Intelligence Robotics I.2.9; I.2.10; H.3.3
url	https://arxiv.org/abs/2510.16643

Similar Items