Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Ruan, Chi, Zhao, Jiying, Chen, Wenhu
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2502.20622
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908653891616768
author	Ruan, Chi Zhao, Jiying Chen, Wenhu
author_facet	Ruan, Chi Zhao, Jiying Chen, Wenhu
contents	Although open-vocabulary object detectors can generalize to unseen categories, they still rely on predefined textual prompts or classifier heads during inference. Recent generative object detectors address this limitation by coupling an autoregressive language model with a detector backbone, enabling direct category name generation for each detected object. However, this straightforward design introduces structural redundancy and substantial latency. In this paper, we propose a Real-Time Generative Detection Transformer (RTGen), a real-time generative object detector with a succinct encoder-decoder architecture. Specifically, we introduce a novel Region-Language Decoder (RL-Decoder) that jointly decodes visual and textual representations within a unified framework. The textual side is organized as a Directed Acyclic Graph (DAG), enabling non-autoregressive category naming. Benefiting from these designs, RTGen-R34 achieves 131.3 FPS on T4 GPUs, over 270x faster than GenerateU. Moreover, our models learn to generate category names directly from detection labels, without relying on external supervision such as CLIP or pretrained language models, achieving efficient and flexible open-ended detection.
format	Preprint
id	arxiv_https___arxiv_org_abs_2502_20622
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	RTGen: Real-Time Generative Detection Transformer Ruan, Chi Zhao, Jiying Chen, Wenhu Computer Vision and Pattern Recognition Although open-vocabulary object detectors can generalize to unseen categories, they still rely on predefined textual prompts or classifier heads during inference. Recent generative object detectors address this limitation by coupling an autoregressive language model with a detector backbone, enabling direct category name generation for each detected object. However, this straightforward design introduces structural redundancy and substantial latency. In this paper, we propose a Real-Time Generative Detection Transformer (RTGen), a real-time generative object detector with a succinct encoder-decoder architecture. Specifically, we introduce a novel Region-Language Decoder (RL-Decoder) that jointly decodes visual and textual representations within a unified framework. The textual side is organized as a Directed Acyclic Graph (DAG), enabling non-autoregressive category naming. Benefiting from these designs, RTGen-R34 achieves 131.3 FPS on T4 GPUs, over 270x faster than GenerateU. Moreover, our models learn to generate category names directly from detection labels, without relying on external supervision such as CLIP or pretrained language models, achieving efficient and flexible open-ended detection.
title	RTGen: Real-Time Generative Detection Transformer
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2502.20622

Similar Items