Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Lin, Chuang, Jiang, Yi, Qu, Lizhen, Yuan, Zehuan, Cai, Jianfei
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2403.10191
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909137922686976
author	Lin, Chuang Jiang, Yi Qu, Lizhen Yuan, Zehuan Cai, Jianfei
author_facet	Lin, Chuang Jiang, Yi Qu, Lizhen Yuan, Zehuan Cai, Jianfei
contents	In recent research, significant attention has been devoted to the open-vocabulary object detection task, aiming to generalize beyond the limited number of classes labeled during training and detect objects described by arbitrary category names at inference. Compared with conventional object detection, open vocabulary object detection largely extends the object detection categories. However, it relies on calculating the similarity between image regions and a set of arbitrary category names with a pretrained vision-and-language model. This implies that, despite its open-set nature, the task still needs the predefined object categories during the inference stage. This raises the question: What if we do not have exact knowledge of object categories during inference? In this paper, we call such a new setting as generative open-ended object detection, which is a more general and practical problem. To address it, we formulate object detection as a generative problem and propose a simple framework named GenerateU, which can detect dense objects and generate their names in a free-form way. Particularly, we employ Deformable DETR as a region proposal generator with a language model translating visual regions to object names. To assess the free-form object detection task, we introduce an evaluation method designed to quantitatively measure the performance of generative outcomes. Extensive experiments demonstrate strong zero-shot detection performance of our GenerateU. For example, on the LVIS dataset, our GenerateU achieves comparable results to the open-vocabulary object detection method GLIP, even though the category names are not seen by GenerateU during inference. Code is available at: https:// github.com/FoundationVision/GenerateU .
format	Preprint
id	arxiv_https___arxiv_org_abs_2403_10191
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Generative Region-Language Pretraining for Open-Ended Object Detection Lin, Chuang Jiang, Yi Qu, Lizhen Yuan, Zehuan Cai, Jianfei Computer Vision and Pattern Recognition In recent research, significant attention has been devoted to the open-vocabulary object detection task, aiming to generalize beyond the limited number of classes labeled during training and detect objects described by arbitrary category names at inference. Compared with conventional object detection, open vocabulary object detection largely extends the object detection categories. However, it relies on calculating the similarity between image regions and a set of arbitrary category names with a pretrained vision-and-language model. This implies that, despite its open-set nature, the task still needs the predefined object categories during the inference stage. This raises the question: What if we do not have exact knowledge of object categories during inference? In this paper, we call such a new setting as generative open-ended object detection, which is a more general and practical problem. To address it, we formulate object detection as a generative problem and propose a simple framework named GenerateU, which can detect dense objects and generate their names in a free-form way. Particularly, we employ Deformable DETR as a region proposal generator with a language model translating visual regions to object names. To assess the free-form object detection task, we introduce an evaluation method designed to quantitatively measure the performance of generative outcomes. Extensive experiments demonstrate strong zero-shot detection performance of our GenerateU. For example, on the LVIS dataset, our GenerateU achieves comparable results to the open-vocabulary object detection method GLIP, even though the category names are not seen by GenerateU during inference. Code is available at: https:// github.com/FoundationVision/GenerateU .
title	Generative Region-Language Pretraining for Open-Ended Object Detection
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2403.10191

Similar Items