Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Etchegaray, Djamahl, Huang, Zi, Harada, Tatsuya, Luo, Yadan
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2403.13556
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911953463541760
author	Etchegaray, Djamahl Huang, Zi Harada, Tatsuya Luo, Yadan
author_facet	Etchegaray, Djamahl Huang, Zi Harada, Tatsuya Luo, Yadan
contents	In this work, we tackle the limitations of current LiDAR-based 3D object detection systems, which are hindered by a restricted class vocabulary and the high costs associated with annotating new object classes. Our exploration of open-vocabulary (OV) learning in urban environments aims to capture novel instances using pre-trained vision-language models (VLMs) with multi-sensor data. We design and benchmark a set of four potential solutions as baselines, categorizing them into either top-down or bottom-up approaches based on their input data strategies. While effective, these methods exhibit certain limitations, such as missing novel objects in 3D box estimation or applying rigorous priors, leading to biases towards objects near the camera or of rectangular geometries. To overcome these limitations, we introduce a universal \textsc{Find n' Propagate} approach for 3D OV tasks, aimed at maximizing the recall of novel objects and propagating this detection capability to more distant areas thereby progressively capturing more. In particular, we utilize a greedy box seeker to search against 3D novel boxes of varying orientations and depth in each generated frustum and ensure the reliability of newly identified boxes by cross alignment and density ranker. Additionally, the inherent bias towards camera-proximal objects is alleviated by the proposed remote simulator, which randomly diversifies pseudo-labeled novel instances in the self-training process, combined with the fusion of base samples in the memory bank. Extensive experiments demonstrate a 53% improvement in novel recall across diverse OV settings, VLMs, and 3D detectors. Notably, we achieve up to a 3.97-fold increase in Average Precision (AP) for novel object classes. The source code is made available at https://github.com/djamahl99/findnpropagate.
format	Preprint
id	arxiv_https___arxiv_org_abs_2403_13556
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Find n' Propagate: Open-Vocabulary 3D Object Detection in Urban Environments Etchegaray, Djamahl Huang, Zi Harada, Tatsuya Luo, Yadan Computer Vision and Pattern Recognition Artificial Intelligence In this work, we tackle the limitations of current LiDAR-based 3D object detection systems, which are hindered by a restricted class vocabulary and the high costs associated with annotating new object classes. Our exploration of open-vocabulary (OV) learning in urban environments aims to capture novel instances using pre-trained vision-language models (VLMs) with multi-sensor data. We design and benchmark a set of four potential solutions as baselines, categorizing them into either top-down or bottom-up approaches based on their input data strategies. While effective, these methods exhibit certain limitations, such as missing novel objects in 3D box estimation or applying rigorous priors, leading to biases towards objects near the camera or of rectangular geometries. To overcome these limitations, we introduce a universal \textsc{Find n' Propagate} approach for 3D OV tasks, aimed at maximizing the recall of novel objects and propagating this detection capability to more distant areas thereby progressively capturing more. In particular, we utilize a greedy box seeker to search against 3D novel boxes of varying orientations and depth in each generated frustum and ensure the reliability of newly identified boxes by cross alignment and density ranker. Additionally, the inherent bias towards camera-proximal objects is alleviated by the proposed remote simulator, which randomly diversifies pseudo-labeled novel instances in the self-training process, combined with the fusion of base samples in the memory bank. Extensive experiments demonstrate a 53% improvement in novel recall across diverse OV settings, VLMs, and 3D detectors. Notably, we achieve up to a 3.97-fold increase in Average Precision (AP) for novel object classes. The source code is made available at https://github.com/djamahl99/findnpropagate.
title	Find n' Propagate: Open-Vocabulary 3D Object Detection in Urban Environments
topic	Computer Vision and Pattern Recognition Artificial Intelligence
url	https://arxiv.org/abs/2403.13556

Similar Items