Saved in:
Bibliographic Details
Main Authors: Chang, Chun-Peng, Wang, Shaoxiang, Pagani, Alain, Stricker, Didier
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2403.03077
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866915041321680896
author Chang, Chun-Peng
Wang, Shaoxiang
Pagani, Alain
Stricker, Didier
author_facet Chang, Chun-Peng
Wang, Shaoxiang
Pagani, Alain
Stricker, Didier
contents 3D visual grounding involves matching natural language descriptions with their corresponding objects in 3D spaces. Existing methods often face challenges with accuracy in object recognition and struggle in interpreting complex linguistic queries, particularly with descriptions that involve multiple anchors or are view-dependent. In response, we present the MiKASA (Multi-Key-Anchor Scene-Aware) Transformer. Our novel end-to-end trained model integrates a self-attention-based scene-aware object encoder and an original multi-key-anchor technique, enhancing object recognition accuracy and the understanding of spatial relationships. Furthermore, MiKASA improves the explainability of decision-making, facilitating error diagnosis. Our model achieves the highest overall accuracy in the Referit3D challenge for both the Sr3D and Nr3D datasets, particularly excelling by a large margin in categories that require viewpoint-dependent descriptions.
format Preprint
id arxiv_https___arxiv_org_abs_2403_03077
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle MiKASA: Multi-Key-Anchor & Scene-Aware Transformer for 3D Visual Grounding
Chang, Chun-Peng
Wang, Shaoxiang
Pagani, Alain
Stricker, Didier
Computer Vision and Pattern Recognition
3D visual grounding involves matching natural language descriptions with their corresponding objects in 3D spaces. Existing methods often face challenges with accuracy in object recognition and struggle in interpreting complex linguistic queries, particularly with descriptions that involve multiple anchors or are view-dependent. In response, we present the MiKASA (Multi-Key-Anchor Scene-Aware) Transformer. Our novel end-to-end trained model integrates a self-attention-based scene-aware object encoder and an original multi-key-anchor technique, enhancing object recognition accuracy and the understanding of spatial relationships. Furthermore, MiKASA improves the explainability of decision-making, facilitating error diagnosis. Our model achieves the highest overall accuracy in the Referit3D challenge for both the Sr3D and Nr3D datasets, particularly excelling by a large margin in categories that require viewpoint-dependent descriptions.
title MiKASA: Multi-Key-Anchor & Scene-Aware Transformer for 3D Visual Grounding
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2403.03077