Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhang, Yukuan, Zhao, Jiarui, Nie, Shangqing, Kuang, Jin, Wang, Shengsheng
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2510.13235
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915555696443392
author	Zhang, Yukuan Zhao, Jiarui Nie, Shangqing Kuang, Jin Wang, Shengsheng
author_facet	Zhang, Yukuan Zhao, Jiarui Nie, Shangqing Kuang, Jin Wang, Shengsheng
contents	Multimodal semantic cues, such as textual descriptions, have shown strong potential in enhancing target perception for tracking. However, existing methods rely on static textual descriptions from large language models, which lack adaptability to real-time target state changes and prone to hallucinations. To address these challenges, we propose a unified multimodal vision-language tracking framework, named EPIPTrack, which leverages explicit and implicit prompts for dynamic target modeling and semantic alignment. Specifically, explicit prompts transform spatial motion information into natural language descriptions to provide spatiotemporal guidance. Implicit prompts combine pseudo-words with learnable descriptors to construct individualized knowledge representations capturing appearance attributes. Both prompts undergo dynamic adjustment via the CLIP text encoder to respond to changes in target state. Furthermore, we design a Discriminative Feature Augmentor to enhance visual and cross-modal representations. Extensive experiments on MOT17, MOT20, and DanceTrack demonstrate that EPIPTrack outperforms existing trackers in diverse scenarios, exhibiting robust adaptability and superior performance.
format	Preprint
id	arxiv_https___arxiv_org_abs_2510_13235
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	EPIPTrack: Rethinking Prompt Modeling with Explicit and Implicit Prompts for Multi-Object Tracking Zhang, Yukuan Zhao, Jiarui Nie, Shangqing Kuang, Jin Wang, Shengsheng Computer Vision and Pattern Recognition Multimodal semantic cues, such as textual descriptions, have shown strong potential in enhancing target perception for tracking. However, existing methods rely on static textual descriptions from large language models, which lack adaptability to real-time target state changes and prone to hallucinations. To address these challenges, we propose a unified multimodal vision-language tracking framework, named EPIPTrack, which leverages explicit and implicit prompts for dynamic target modeling and semantic alignment. Specifically, explicit prompts transform spatial motion information into natural language descriptions to provide spatiotemporal guidance. Implicit prompts combine pseudo-words with learnable descriptors to construct individualized knowledge representations capturing appearance attributes. Both prompts undergo dynamic adjustment via the CLIP text encoder to respond to changes in target state. Furthermore, we design a Discriminative Feature Augmentor to enhance visual and cross-modal representations. Extensive experiments on MOT17, MOT20, and DanceTrack demonstrate that EPIPTrack outperforms existing trackers in diverse scenarios, exhibiting robust adaptability and superior performance.
title	EPIPTrack: Rethinking Prompt Modeling with Explicit and Implicit Prompts for Multi-Object Tracking
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2510.13235

Similar Items