Saved in:
Bibliographic Details
Main Authors: Ruan, Chi, Zhao, Jiying, Chen, Wenhu
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2502.20622
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866908653891616768
author Ruan, Chi
Zhao, Jiying
Chen, Wenhu
author_facet Ruan, Chi
Zhao, Jiying
Chen, Wenhu
contents Although open-vocabulary object detectors can generalize to unseen categories, they still rely on predefined textual prompts or classifier heads during inference. Recent generative object detectors address this limitation by coupling an autoregressive language model with a detector backbone, enabling direct category name generation for each detected object. However, this straightforward design introduces structural redundancy and substantial latency. In this paper, we propose a Real-Time Generative Detection Transformer (RTGen), a real-time generative object detector with a succinct encoder-decoder architecture. Specifically, we introduce a novel Region-Language Decoder (RL-Decoder) that jointly decodes visual and textual representations within a unified framework. The textual side is organized as a Directed Acyclic Graph (DAG), enabling non-autoregressive category naming. Benefiting from these designs, RTGen-R34 achieves 131.3 FPS on T4 GPUs, over 270x faster than GenerateU. Moreover, our models learn to generate category names directly from detection labels, without relying on external supervision such as CLIP or pretrained language models, achieving efficient and flexible open-ended detection.
format Preprint
id arxiv_https___arxiv_org_abs_2502_20622
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle RTGen: Real-Time Generative Detection Transformer
Ruan, Chi
Zhao, Jiying
Chen, Wenhu
Computer Vision and Pattern Recognition
Although open-vocabulary object detectors can generalize to unseen categories, they still rely on predefined textual prompts or classifier heads during inference. Recent generative object detectors address this limitation by coupling an autoregressive language model with a detector backbone, enabling direct category name generation for each detected object. However, this straightforward design introduces structural redundancy and substantial latency. In this paper, we propose a Real-Time Generative Detection Transformer (RTGen), a real-time generative object detector with a succinct encoder-decoder architecture. Specifically, we introduce a novel Region-Language Decoder (RL-Decoder) that jointly decodes visual and textual representations within a unified framework. The textual side is organized as a Directed Acyclic Graph (DAG), enabling non-autoregressive category naming. Benefiting from these designs, RTGen-R34 achieves 131.3 FPS on T4 GPUs, over 270x faster than GenerateU. Moreover, our models learn to generate category names directly from detection labels, without relying on external supervision such as CLIP or pretrained language models, achieving efficient and flexible open-ended detection.
title RTGen: Real-Time Generative Detection Transformer
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2502.20622