Saved in:
Bibliographic Details
Main Authors: Lin, Chuang, Jiang, Yi, Qu, Lizhen, Yuan, Zehuan, Cai, Jianfei
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2403.10191
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866909137922686976
author Lin, Chuang
Jiang, Yi
Qu, Lizhen
Yuan, Zehuan
Cai, Jianfei
author_facet Lin, Chuang
Jiang, Yi
Qu, Lizhen
Yuan, Zehuan
Cai, Jianfei
contents In recent research, significant attention has been devoted to the open-vocabulary object detection task, aiming to generalize beyond the limited number of classes labeled during training and detect objects described by arbitrary category names at inference. Compared with conventional object detection, open vocabulary object detection largely extends the object detection categories. However, it relies on calculating the similarity between image regions and a set of arbitrary category names with a pretrained vision-and-language model. This implies that, despite its open-set nature, the task still needs the predefined object categories during the inference stage. This raises the question: What if we do not have exact knowledge of object categories during inference? In this paper, we call such a new setting as generative open-ended object detection, which is a more general and practical problem. To address it, we formulate object detection as a generative problem and propose a simple framework named GenerateU, which can detect dense objects and generate their names in a free-form way. Particularly, we employ Deformable DETR as a region proposal generator with a language model translating visual regions to object names. To assess the free-form object detection task, we introduce an evaluation method designed to quantitatively measure the performance of generative outcomes. Extensive experiments demonstrate strong zero-shot detection performance of our GenerateU. For example, on the LVIS dataset, our GenerateU achieves comparable results to the open-vocabulary object detection method GLIP, even though the category names are not seen by GenerateU during inference. Code is available at: https:// github.com/FoundationVision/GenerateU .
format Preprint
id arxiv_https___arxiv_org_abs_2403_10191
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Generative Region-Language Pretraining for Open-Ended Object Detection
Lin, Chuang
Jiang, Yi
Qu, Lizhen
Yuan, Zehuan
Cai, Jianfei
Computer Vision and Pattern Recognition
In recent research, significant attention has been devoted to the open-vocabulary object detection task, aiming to generalize beyond the limited number of classes labeled during training and detect objects described by arbitrary category names at inference. Compared with conventional object detection, open vocabulary object detection largely extends the object detection categories. However, it relies on calculating the similarity between image regions and a set of arbitrary category names with a pretrained vision-and-language model. This implies that, despite its open-set nature, the task still needs the predefined object categories during the inference stage. This raises the question: What if we do not have exact knowledge of object categories during inference? In this paper, we call such a new setting as generative open-ended object detection, which is a more general and practical problem. To address it, we formulate object detection as a generative problem and propose a simple framework named GenerateU, which can detect dense objects and generate their names in a free-form way. Particularly, we employ Deformable DETR as a region proposal generator with a language model translating visual regions to object names. To assess the free-form object detection task, we introduce an evaluation method designed to quantitatively measure the performance of generative outcomes. Extensive experiments demonstrate strong zero-shot detection performance of our GenerateU. For example, on the LVIS dataset, our GenerateU achieves comparable results to the open-vocabulary object detection method GLIP, even though the category names are not seen by GenerateU during inference. Code is available at: https:// github.com/FoundationVision/GenerateU .
title Generative Region-Language Pretraining for Open-Ended Object Detection
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2403.10191