Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Linok, Sergey, Zemskova, Tatiana, Ladanova, Svetlana, Titkov, Roman, Yudin, Dmitry, Monastyrny, Maxim, Valenkov, Aleksei
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2406.07113
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908350895095808
author	Linok, Sergey Zemskova, Tatiana Ladanova, Svetlana Titkov, Roman Yudin, Dmitry Monastyrny, Maxim Valenkov, Aleksei
author_facet	Linok, Sergey Zemskova, Tatiana Ladanova, Svetlana Titkov, Roman Yudin, Dmitry Monastyrny, Maxim Valenkov, Aleksei
contents	Locating objects described in natural language presents a significant challenge for autonomous agents. Existing CLIP-based open-vocabulary methods successfully perform 3D object grounding with simple (bare) queries, but cannot cope with ambiguous descriptions that demand an understanding of object relations. To tackle this problem, we propose a modular approach called BBQ (Beyond Bare Queries), which constructs 3D scene graph representation with metric and semantic spatial edges and utilizes a large language model as a human-to-agent interface through our deductive scene reasoning algorithm. BBQ employs robust DINO-powered associations to construct 3D object-centric map and an advanced raycasting algorithm with a 2D vision-language model to describe them as graph nodes. On the Replica and ScanNet datasets, we have demonstrated that BBQ takes a leading place in open-vocabulary 3D semantic segmentation compared to other zero-shot methods. Also, we show that leveraging spatial relations is especially effective for scenes containing multiple entities of the same semantic class. On challenging Sr3D+, Nr3D and ScanRefer benchmarks, our deductive approach demonstrates a significant improvement, enabling objects grounding by complex queries compared to other state-of-the-art methods. The combination of our design choices and software implementation has resulted in significant data processing speed in experiments on the robot on-board computer. This promising performance enables the application of our approach in intelligent robotics projects. We made the code publicly available at https://linukc.github.io/BeyondBareQueries/.
format	Preprint
id	arxiv_https___arxiv_org_abs_2406_07113
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Beyond Bare Queries: Open-Vocabulary Object Grounding with 3D Scene Graph Linok, Sergey Zemskova, Tatiana Ladanova, Svetlana Titkov, Roman Yudin, Dmitry Monastyrny, Maxim Valenkov, Aleksei Computer Vision and Pattern Recognition Artificial Intelligence Locating objects described in natural language presents a significant challenge for autonomous agents. Existing CLIP-based open-vocabulary methods successfully perform 3D object grounding with simple (bare) queries, but cannot cope with ambiguous descriptions that demand an understanding of object relations. To tackle this problem, we propose a modular approach called BBQ (Beyond Bare Queries), which constructs 3D scene graph representation with metric and semantic spatial edges and utilizes a large language model as a human-to-agent interface through our deductive scene reasoning algorithm. BBQ employs robust DINO-powered associations to construct 3D object-centric map and an advanced raycasting algorithm with a 2D vision-language model to describe them as graph nodes. On the Replica and ScanNet datasets, we have demonstrated that BBQ takes a leading place in open-vocabulary 3D semantic segmentation compared to other zero-shot methods. Also, we show that leveraging spatial relations is especially effective for scenes containing multiple entities of the same semantic class. On challenging Sr3D+, Nr3D and ScanRefer benchmarks, our deductive approach demonstrates a significant improvement, enabling objects grounding by complex queries compared to other state-of-the-art methods. The combination of our design choices and software implementation has resulted in significant data processing speed in experiments on the robot on-board computer. This promising performance enables the application of our approach in intelligent robotics projects. We made the code publicly available at https://linukc.github.io/BeyondBareQueries/.
title	Beyond Bare Queries: Open-Vocabulary Object Grounding with 3D Scene Graph
topic	Computer Vision and Pattern Recognition Artificial Intelligence
url	https://arxiv.org/abs/2406.07113

Similar Items