Saved in:
Bibliographic Details
Main Authors: Li, Caixiong, Zhao, Xiongwei, Zhang, Jinhang, Zhang, Xing, Sun, Qihao, Wu, Zhou
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2502.16486
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866915173305942016
author Li, Caixiong
Zhao, Xiongwei
Zhang, Jinhang
Zhang, Xing
Sun, Qihao
Wu, Zhou
author_facet Li, Caixiong
Zhao, Xiongwei
Zhang, Jinhang
Zhang, Xing
Sun, Qihao
Wu, Zhou
contents Open-vocabulary detection (OVD) is a challenging task to detect and classify objects from an unrestricted set of categories, including those unseen during training. Existing open-vocabulary detectors are limited by complex visual-textual misalignment and long-tailed category imbalances, leading to suboptimal performance in challenging scenarios. To address these limitations, we introduce MQADet, a universal paradigm for enhancing existing open-vocabulary detectors by leveraging the cross-modal reasoning capabilities of multimodal large language models (MLLMs). MQADet functions as a plug-and-play solution that integrates seamlessly with pre-trained object detectors without substantial additional training costs. Specifically, we design a novel three-stage Multimodal Question Answering (MQA) pipeline to guide the MLLMs to precisely localize complex textual and visual targets while effectively enhancing the focus of existing object detectors on relevant objects. To validate our approach, we present a new benchmark for evaluating our paradigm on four challenging open-vocabulary datasets, employing three state-of-the-art object detectors as baselines. Experimental results demonstrate that our proposed paradigm significantly improves the performance of existing detectors, particularly in unseen complex categories, across diverse and challenging scenarios. To facilitate future research, we will publicly release our code.
format Preprint
id arxiv_https___arxiv_org_abs_2502_16486
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle MQADet: A Plug-and-Play Paradigm for Enhancing Open-Vocabulary Object Detection via Multimodal Question Answering
Li, Caixiong
Zhao, Xiongwei
Zhang, Jinhang
Zhang, Xing
Sun, Qihao
Wu, Zhou
Computer Vision and Pattern Recognition
Open-vocabulary detection (OVD) is a challenging task to detect and classify objects from an unrestricted set of categories, including those unseen during training. Existing open-vocabulary detectors are limited by complex visual-textual misalignment and long-tailed category imbalances, leading to suboptimal performance in challenging scenarios. To address these limitations, we introduce MQADet, a universal paradigm for enhancing existing open-vocabulary detectors by leveraging the cross-modal reasoning capabilities of multimodal large language models (MLLMs). MQADet functions as a plug-and-play solution that integrates seamlessly with pre-trained object detectors without substantial additional training costs. Specifically, we design a novel three-stage Multimodal Question Answering (MQA) pipeline to guide the MLLMs to precisely localize complex textual and visual targets while effectively enhancing the focus of existing object detectors on relevant objects. To validate our approach, we present a new benchmark for evaluating our paradigm on four challenging open-vocabulary datasets, employing three state-of-the-art object detectors as baselines. Experimental results demonstrate that our proposed paradigm significantly improves the performance of existing detectors, particularly in unseen complex categories, across diverse and challenging scenarios. To facilitate future research, we will publicly release our code.
title MQADet: A Plug-and-Play Paradigm for Enhancing Open-Vocabulary Object Detection via Multimodal Question Answering
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2502.16486