Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Li, Yuzhen, Liu, Min, Bian, Yuan, Wang, Xueping, Li, Zhaoyang, Li, Gen, Wang, Yaonan
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Information systems~Multimedia and multimodal retrieval
Online Access:	https://arxiv.org/abs/2508.19165
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908504649891840
author	Li, Yuzhen Liu, Min Bian, Yuan Wang, Xueping Li, Zhaoyang Li, Gen Wang, Yaonan
author_facet	Li, Yuzhen Liu, Min Bian, Yuan Wang, Xueping Li, Zhaoyang Li, Gen Wang, Yaonan
contents	Monocular 3D visual grounding is a novel task that aims to locate 3D objects in RGB images using text descriptions with explicit geometry information. Despite the inclusion of geometry details in the text, we observe that the text embeddings are sensitive to the magnitude of numerical values but largely ignore the associated measurement units. For example, simply equidistant mapping the length with unit "meter" to "decimeters" or "centimeters" leads to severe performance degradation, even though the physical length remains equivalent. This observation signifies the weak 3D comprehension of pre-trained language model, which generates misguiding text features to hinder 3D perception. Therefore, we propose to enhance the 3D perception of model on text embeddings and geometry features with two simple and effective methods. Firstly, we introduce a pre-processing method named 3D-text Enhancement (3DTE), which enhances the comprehension of mapping relationships between different units by augmenting the diversity of distance descriptors in text queries. Next, we propose a Text-Guided Geometry Enhancement (TGE) module to further enhance the 3D-text information by projecting the basic text features into geometrically consistent space. These 3D-enhanced text features are then leveraged to precisely guide the attention of geometry features. We evaluate the proposed method through extensive comparisons and ablation studies on the Mono3DRefer dataset. Experimental results demonstrate substantial improvements over previous methods, achieving new state-of-the-art results with a notable accuracy gain of 11.94\% in the "Far" scenario. Our code will be made publicly available.
format	Preprint
id	arxiv_https___arxiv_org_abs_2508_19165
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Dual Enhancement on 3D Vision-Language Perception for Monocular 3D Visual Grounding Li, Yuzhen Liu, Min Bian, Yuan Wang, Xueping Li, Zhaoyang Li, Gen Wang, Yaonan Computer Vision and Pattern Recognition Information systems~Multimedia and multimodal retrieval Monocular 3D visual grounding is a novel task that aims to locate 3D objects in RGB images using text descriptions with explicit geometry information. Despite the inclusion of geometry details in the text, we observe that the text embeddings are sensitive to the magnitude of numerical values but largely ignore the associated measurement units. For example, simply equidistant mapping the length with unit "meter" to "decimeters" or "centimeters" leads to severe performance degradation, even though the physical length remains equivalent. This observation signifies the weak 3D comprehension of pre-trained language model, which generates misguiding text features to hinder 3D perception. Therefore, we propose to enhance the 3D perception of model on text embeddings and geometry features with two simple and effective methods. Firstly, we introduce a pre-processing method named 3D-text Enhancement (3DTE), which enhances the comprehension of mapping relationships between different units by augmenting the diversity of distance descriptors in text queries. Next, we propose a Text-Guided Geometry Enhancement (TGE) module to further enhance the 3D-text information by projecting the basic text features into geometrically consistent space. These 3D-enhanced text features are then leveraged to precisely guide the attention of geometry features. We evaluate the proposed method through extensive comparisons and ablation studies on the Mono3DRefer dataset. Experimental results demonstrate substantial improvements over previous methods, achieving new state-of-the-art results with a notable accuracy gain of 11.94\% in the "Far" scenario. Our code will be made publicly available.
title	Dual Enhancement on 3D Vision-Language Perception for Monocular 3D Visual Grounding
topic	Computer Vision and Pattern Recognition Information systems~Multimedia and multimodal retrieval
url	https://arxiv.org/abs/2508.19165

Similar Items