Saved in:
| Main Authors: | , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2404.12031 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866913320236220416 |
|---|---|
| author | Ma, Zeliang Yang, Song Cui, Zhe Zhao, Zhicheng Su, Fei Liu, Delong Wang, Jingyu |
| author_facet | Ma, Zeliang Yang, Song Cui, Zhe Zhao, Zhicheng Su, Fei Liu, Delong Wang, Jingyu |
| contents | The new trend in multi-object tracking task is to track objects of interest using natural language. However, the scarcity of paired prompt-instance data hinders its progress. To address this challenge, we propose a high-quality yet low-cost data generation method base on Unreal Engine 5 and construct a brand-new benchmark dataset, named Refer-UE-City, which primarily includes scenes from intersection surveillance videos, detailing the appearance and actions of people and vehicles. Specifically, it provides 14 videos with a total of 714 expressions, and is comparable in scale to the Refer-KITTI dataset. Additionally, we propose a multi-level semantic-guided multi-object framework called MLS-Track, where the interaction between the model and text is enhanced layer by layer through the introduction of Semantic Guidance Module (SGM) and Semantic Correlation Branch (SCB). Extensive experiments on Refer-UE-City and Refer-KITTI datasets demonstrate the effectiveness of our proposed framework and it achieves state-of-the-art performance. Code and datatsets will be available. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2404_12031 |
| institution | arXiv |
| publishDate | 2024 |
| record_format | arxiv |
| spellingShingle | MLS-Track: Multilevel Semantic Interaction in RMOT Ma, Zeliang Yang, Song Cui, Zhe Zhao, Zhicheng Su, Fei Liu, Delong Wang, Jingyu Computer Vision and Pattern Recognition The new trend in multi-object tracking task is to track objects of interest using natural language. However, the scarcity of paired prompt-instance data hinders its progress. To address this challenge, we propose a high-quality yet low-cost data generation method base on Unreal Engine 5 and construct a brand-new benchmark dataset, named Refer-UE-City, which primarily includes scenes from intersection surveillance videos, detailing the appearance and actions of people and vehicles. Specifically, it provides 14 videos with a total of 714 expressions, and is comparable in scale to the Refer-KITTI dataset. Additionally, we propose a multi-level semantic-guided multi-object framework called MLS-Track, where the interaction between the model and text is enhanced layer by layer through the introduction of Semantic Guidance Module (SGM) and Semantic Correlation Branch (SCB). Extensive experiments on Refer-UE-City and Refer-KITTI datasets demonstrate the effectiveness of our proposed framework and it achieves state-of-the-art performance. Code and datatsets will be available. |
| title | MLS-Track: Multilevel Semantic Interaction in RMOT |
| topic | Computer Vision and Pattern Recognition |
| url | https://arxiv.org/abs/2404.12031 |