Saved in:
Bibliographic Details
Main Authors: Linok, Sergey, Zemskova, Tatiana, Ladanova, Svetlana, Titkov, Roman, Yudin, Dmitry, Monastyrny, Maxim, Valenkov, Aleksei
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2406.07113
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866908350895095808
author Linok, Sergey
Zemskova, Tatiana
Ladanova, Svetlana
Titkov, Roman
Yudin, Dmitry
Monastyrny, Maxim
Valenkov, Aleksei
author_facet Linok, Sergey
Zemskova, Tatiana
Ladanova, Svetlana
Titkov, Roman
Yudin, Dmitry
Monastyrny, Maxim
Valenkov, Aleksei
contents Locating objects described in natural language presents a significant challenge for autonomous agents. Existing CLIP-based open-vocabulary methods successfully perform 3D object grounding with simple (bare) queries, but cannot cope with ambiguous descriptions that demand an understanding of object relations. To tackle this problem, we propose a modular approach called BBQ (Beyond Bare Queries), which constructs 3D scene graph representation with metric and semantic spatial edges and utilizes a large language model as a human-to-agent interface through our deductive scene reasoning algorithm. BBQ employs robust DINO-powered associations to construct 3D object-centric map and an advanced raycasting algorithm with a 2D vision-language model to describe them as graph nodes. On the Replica and ScanNet datasets, we have demonstrated that BBQ takes a leading place in open-vocabulary 3D semantic segmentation compared to other zero-shot methods. Also, we show that leveraging spatial relations is especially effective for scenes containing multiple entities of the same semantic class. On challenging Sr3D+, Nr3D and ScanRefer benchmarks, our deductive approach demonstrates a significant improvement, enabling objects grounding by complex queries compared to other state-of-the-art methods. The combination of our design choices and software implementation has resulted in significant data processing speed in experiments on the robot on-board computer. This promising performance enables the application of our approach in intelligent robotics projects. We made the code publicly available at https://linukc.github.io/BeyondBareQueries/.
format Preprint
id arxiv_https___arxiv_org_abs_2406_07113
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Beyond Bare Queries: Open-Vocabulary Object Grounding with 3D Scene Graph
Linok, Sergey
Zemskova, Tatiana
Ladanova, Svetlana
Titkov, Roman
Yudin, Dmitry
Monastyrny, Maxim
Valenkov, Aleksei
Computer Vision and Pattern Recognition
Artificial Intelligence
Locating objects described in natural language presents a significant challenge for autonomous agents. Existing CLIP-based open-vocabulary methods successfully perform 3D object grounding with simple (bare) queries, but cannot cope with ambiguous descriptions that demand an understanding of object relations. To tackle this problem, we propose a modular approach called BBQ (Beyond Bare Queries), which constructs 3D scene graph representation with metric and semantic spatial edges and utilizes a large language model as a human-to-agent interface through our deductive scene reasoning algorithm. BBQ employs robust DINO-powered associations to construct 3D object-centric map and an advanced raycasting algorithm with a 2D vision-language model to describe them as graph nodes. On the Replica and ScanNet datasets, we have demonstrated that BBQ takes a leading place in open-vocabulary 3D semantic segmentation compared to other zero-shot methods. Also, we show that leveraging spatial relations is especially effective for scenes containing multiple entities of the same semantic class. On challenging Sr3D+, Nr3D and ScanRefer benchmarks, our deductive approach demonstrates a significant improvement, enabling objects grounding by complex queries compared to other state-of-the-art methods. The combination of our design choices and software implementation has resulted in significant data processing speed in experiments on the robot on-board computer. This promising performance enables the application of our approach in intelligent robotics projects. We made the code publicly available at https://linukc.github.io/BeyondBareQueries/.
title Beyond Bare Queries: Open-Vocabulary Object Grounding with 3D Scene Graph
topic Computer Vision and Pattern Recognition
Artificial Intelligence
url https://arxiv.org/abs/2406.07113