Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Pang, Mingxi, Wang, Dingheng, Li, Zekun, Sun, Zhenping, Wang, Bo, Wang, Zhihang, Yang, Zhao-Xu
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2604.17024
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914486525362176
author	Pang, Mingxi Wang, Dingheng Li, Zekun Sun, Zhenping Wang, Bo Wang, Zhihang Yang, Zhao-Xu
author_facet	Pang, Mingxi Wang, Dingheng Li, Zekun Sun, Zhenping Wang, Bo Wang, Zhihang Yang, Zhao-Xu
contents	Query-based 3D object detection methods using multi-view images often struggle to efficiently leverage dynamic multi-scale information, e.g., the relationship between the object features and the geometric of the queries are not sufficiently learned, directly exploring the multi-scale spatiotemporal features will pay too many costs. To address these challenges, we propose CAM3DNet, a novel sparse query-based framework which combines three new modules, composite query (CQ), adaptive self-attention (ASA), and multi-scale hybrid sampling (MSHS). First, the core idea in the CQ module is a multi-scale projection strategy to transform 2D queries into 3D space. Second, the ASA module learns the interactions between the spatiotemporal multi-scale queries. Third, the MSHS module uses the deformable attention mechanism to sample multi-scale object information by considering multi-scales queries, pyramid feature maps, and 2D-camera prior knowledge. The entire model employs a backbone network and a feature pyramid network (FPN) as the encoder, then introduces a YOLOX and a DepthNet as a ROI\_Head to produce CQ, and repeatedly utilizes ASA and MSHS as the decoder to gain detection features. Extensive experiments on the nuScenes, Waymo, and Argoverse benchmark datasets demonstrate the effectiveness of our CAM3DNet, and most existing camera-based 3D object detection methods are outperformed. Besides, we make comprehensive ablation studies to check the individual effect of CQ, ASA, and MSHS, as well as their cost of space and computation complexity.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_17024
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	CAM3DNet: Comprehensively mining the multi-scale features for 3D Object Detection with Multi-View Cameras Pang, Mingxi Wang, Dingheng Li, Zekun Sun, Zhenping Wang, Bo Wang, Zhihang Yang, Zhao-Xu Computer Vision and Pattern Recognition Query-based 3D object detection methods using multi-view images often struggle to efficiently leverage dynamic multi-scale information, e.g., the relationship between the object features and the geometric of the queries are not sufficiently learned, directly exploring the multi-scale spatiotemporal features will pay too many costs. To address these challenges, we propose CAM3DNet, a novel sparse query-based framework which combines three new modules, composite query (CQ), adaptive self-attention (ASA), and multi-scale hybrid sampling (MSHS). First, the core idea in the CQ module is a multi-scale projection strategy to transform 2D queries into 3D space. Second, the ASA module learns the interactions between the spatiotemporal multi-scale queries. Third, the MSHS module uses the deformable attention mechanism to sample multi-scale object information by considering multi-scales queries, pyramid feature maps, and 2D-camera prior knowledge. The entire model employs a backbone network and a feature pyramid network (FPN) as the encoder, then introduces a YOLOX and a DepthNet as a ROI\_Head to produce CQ, and repeatedly utilizes ASA and MSHS as the decoder to gain detection features. Extensive experiments on the nuScenes, Waymo, and Argoverse benchmark datasets demonstrate the effectiveness of our CAM3DNet, and most existing camera-based 3D object detection methods are outperformed. Besides, we make comprehensive ablation studies to check the individual effect of CQ, ASA, and MSHS, as well as their cost of space and computation complexity.
title	CAM3DNet: Comprehensively mining the multi-scale features for 3D Object Detection with Multi-View Cameras
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2604.17024

Similar Items