Saved in:
Bibliographic Details
Main Authors: Ma, Zeliang, Yang, Song, Cui, Zhe, Zhao, Zhicheng, Su, Fei, Liu, Delong, Wang, Jingyu
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2404.12031
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866913320236220416
author Ma, Zeliang
Yang, Song
Cui, Zhe
Zhao, Zhicheng
Su, Fei
Liu, Delong
Wang, Jingyu
author_facet Ma, Zeliang
Yang, Song
Cui, Zhe
Zhao, Zhicheng
Su, Fei
Liu, Delong
Wang, Jingyu
contents The new trend in multi-object tracking task is to track objects of interest using natural language. However, the scarcity of paired prompt-instance data hinders its progress. To address this challenge, we propose a high-quality yet low-cost data generation method base on Unreal Engine 5 and construct a brand-new benchmark dataset, named Refer-UE-City, which primarily includes scenes from intersection surveillance videos, detailing the appearance and actions of people and vehicles. Specifically, it provides 14 videos with a total of 714 expressions, and is comparable in scale to the Refer-KITTI dataset. Additionally, we propose a multi-level semantic-guided multi-object framework called MLS-Track, where the interaction between the model and text is enhanced layer by layer through the introduction of Semantic Guidance Module (SGM) and Semantic Correlation Branch (SCB). Extensive experiments on Refer-UE-City and Refer-KITTI datasets demonstrate the effectiveness of our proposed framework and it achieves state-of-the-art performance. Code and datatsets will be available.
format Preprint
id arxiv_https___arxiv_org_abs_2404_12031
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle MLS-Track: Multilevel Semantic Interaction in RMOT
Ma, Zeliang
Yang, Song
Cui, Zhe
Zhao, Zhicheng
Su, Fei
Liu, Delong
Wang, Jingyu
Computer Vision and Pattern Recognition
The new trend in multi-object tracking task is to track objects of interest using natural language. However, the scarcity of paired prompt-instance data hinders its progress. To address this challenge, we propose a high-quality yet low-cost data generation method base on Unreal Engine 5 and construct a brand-new benchmark dataset, named Refer-UE-City, which primarily includes scenes from intersection surveillance videos, detailing the appearance and actions of people and vehicles. Specifically, it provides 14 videos with a total of 714 expressions, and is comparable in scale to the Refer-KITTI dataset. Additionally, we propose a multi-level semantic-guided multi-object framework called MLS-Track, where the interaction between the model and text is enhanced layer by layer through the introduction of Semantic Guidance Module (SGM) and Semantic Correlation Branch (SCB). Extensive experiments on Refer-UE-City and Refer-KITTI datasets demonstrate the effectiveness of our proposed framework and it achieves state-of-the-art performance. Code and datatsets will be available.
title MLS-Track: Multilevel Semantic Interaction in RMOT
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2404.12031