Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Chang, Chun-Peng, Wang, Shaoxiang, Pagani, Alain, Stricker, Didier
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2403.03077
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915041321680896
author	Chang, Chun-Peng Wang, Shaoxiang Pagani, Alain Stricker, Didier
author_facet	Chang, Chun-Peng Wang, Shaoxiang Pagani, Alain Stricker, Didier
contents	3D visual grounding involves matching natural language descriptions with their corresponding objects in 3D spaces. Existing methods often face challenges with accuracy in object recognition and struggle in interpreting complex linguistic queries, particularly with descriptions that involve multiple anchors or are view-dependent. In response, we present the MiKASA (Multi-Key-Anchor Scene-Aware) Transformer. Our novel end-to-end trained model integrates a self-attention-based scene-aware object encoder and an original multi-key-anchor technique, enhancing object recognition accuracy and the understanding of spatial relationships. Furthermore, MiKASA improves the explainability of decision-making, facilitating error diagnosis. Our model achieves the highest overall accuracy in the Referit3D challenge for both the Sr3D and Nr3D datasets, particularly excelling by a large margin in categories that require viewpoint-dependent descriptions.
format	Preprint
id	arxiv_https___arxiv_org_abs_2403_03077
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	MiKASA: Multi-Key-Anchor & Scene-Aware Transformer for 3D Visual Grounding Chang, Chun-Peng Wang, Shaoxiang Pagani, Alain Stricker, Didier Computer Vision and Pattern Recognition 3D visual grounding involves matching natural language descriptions with their corresponding objects in 3D spaces. Existing methods often face challenges with accuracy in object recognition and struggle in interpreting complex linguistic queries, particularly with descriptions that involve multiple anchors or are view-dependent. In response, we present the MiKASA (Multi-Key-Anchor Scene-Aware) Transformer. Our novel end-to-end trained model integrates a self-attention-based scene-aware object encoder and an original multi-key-anchor technique, enhancing object recognition accuracy and the understanding of spatial relationships. Furthermore, MiKASA improves the explainability of decision-making, facilitating error diagnosis. Our model achieves the highest overall accuracy in the Referit3D challenge for both the Sr3D and Nr3D datasets, particularly excelling by a large margin in categories that require viewpoint-dependent descriptions.
title	MiKASA: Multi-Key-Anchor & Scene-Aware Transformer for 3D Visual Grounding
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2403.03077

Similar Items