Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Jiang, Qing, Wu, Lin, Zeng, Zhaoyang, Ren, Tianhe, Xiong, Yuda, Chen, Yihao, Liu, Qin, Zhang, Lei
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2503.08507
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909606904594432
author	Jiang, Qing Wu, Lin Zeng, Zhaoyang Ren, Tianhe Xiong, Yuda Chen, Yihao Liu, Qin Zhang, Lei
author_facet	Jiang, Qing Wu, Lin Zeng, Zhaoyang Ren, Tianhe Xiong, Yuda Chen, Yihao Liu, Qin Zhang, Lei
contents	Humans are undoubtedly the most important participants in computer vision, and the ability to detect any individual given a natural language description, a task we define as referring to any person, holds substantial practical value. However, we find that existing models generally fail to achieve real-world usability, and current benchmarks are limited by their focus on one-to-one referring, that hinder progress in this area. In this work, we revisit this task from three critical perspectives: task definition, dataset design, and model architecture. We first identify five aspects of referable entities and three distinctive characteristics of this task. Next, we introduce HumanRef, a novel dataset designed to tackle these challenges and better reflect real-world applications. From a model design perspective, we integrate a multimodal large language model with an object detection framework, constructing a robust referring model named RexSeek. Experimental results reveal that state-of-the-art models, which perform well on commonly used benchmarks like RefCOCO/+/g, struggle with HumanRef due to their inability to detect multiple individuals. In contrast, RexSeek not only excels in human referring but also generalizes effectively to common object referring, making it broadly applicable across various perception tasks. Code is available at https://github.com/IDEA-Research/RexSeek
format	Preprint
id	arxiv_https___arxiv_org_abs_2503_08507
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Referring to Any Person Jiang, Qing Wu, Lin Zeng, Zhaoyang Ren, Tianhe Xiong, Yuda Chen, Yihao Liu, Qin Zhang, Lei Computer Vision and Pattern Recognition Humans are undoubtedly the most important participants in computer vision, and the ability to detect any individual given a natural language description, a task we define as referring to any person, holds substantial practical value. However, we find that existing models generally fail to achieve real-world usability, and current benchmarks are limited by their focus on one-to-one referring, that hinder progress in this area. In this work, we revisit this task from three critical perspectives: task definition, dataset design, and model architecture. We first identify five aspects of referable entities and three distinctive characteristics of this task. Next, we introduce HumanRef, a novel dataset designed to tackle these challenges and better reflect real-world applications. From a model design perspective, we integrate a multimodal large language model with an object detection framework, constructing a robust referring model named RexSeek. Experimental results reveal that state-of-the-art models, which perform well on commonly used benchmarks like RefCOCO/+/g, struggle with HumanRef due to their inability to detect multiple individuals. In contrast, RexSeek not only excels in human referring but also generalizes effectively to common object referring, making it broadly applicable across various perception tasks. Code is available at https://github.com/IDEA-Research/RexSeek
title	Referring to Any Person
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2503.08507

Similar Items