Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2507.00506 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866909668572397568 |
|---|---|
| author | Xie, Yunfei Cheng, Yuxuan Wu, Juncheng Zhang, Haoyu Zhou, Yuyin Han, Shoudong |
| author_facet | Xie, Yunfei Cheng, Yuxuan Wu, Juncheng Zhang, Haoyu Zhou, Yuyin Han, Shoudong |
| contents | Recent advancements in adapting vision-language pre-training models like CLIP for person re-identification (ReID) tasks often rely on complex adapter design or modality-specific tuning while neglecting cross-modal interaction, leading to high computational costs or suboptimal alignment. To address these limitations, we propose a simple yet effective framework named Selective Cross-modal Prompt Tuning (SCING) that enhances cross-modal alignment and robustness against real-world perturbations. Our method introduces two key innovations: Firstly, we proposed Selective Visual Prompt Fusion (SVIP), a lightweight module that dynamically injects discriminative visual features into text prompts via a cross-modal gating mechanism. Moreover, the proposed Perturbation-Driven Consistency Alignment (PDCA) is a dual-path training strategy that enforces invariant feature alignment under random image perturbations by regularizing consistency between original and augmented cross-modal embeddings. Extensive experiments are conducted on several popular benchmarks covering Market1501, DukeMTMC-ReID, Occluded-Duke, Occluded-REID, and P-DukeMTMC, which demonstrate the impressive performance of the proposed method. Notably, our framework eliminates heavy adapters while maintaining efficient inference, achieving an optimal trade-off between performance and computational overhead. The code will be released upon acceptance. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2507_00506 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | SCING:Towards More Efficient and Robust Person Re-Identification through Selective Cross-modal Prompt Tuning Xie, Yunfei Cheng, Yuxuan Wu, Juncheng Zhang, Haoyu Zhou, Yuyin Han, Shoudong Computer Vision and Pattern Recognition Recent advancements in adapting vision-language pre-training models like CLIP for person re-identification (ReID) tasks often rely on complex adapter design or modality-specific tuning while neglecting cross-modal interaction, leading to high computational costs or suboptimal alignment. To address these limitations, we propose a simple yet effective framework named Selective Cross-modal Prompt Tuning (SCING) that enhances cross-modal alignment and robustness against real-world perturbations. Our method introduces two key innovations: Firstly, we proposed Selective Visual Prompt Fusion (SVIP), a lightweight module that dynamically injects discriminative visual features into text prompts via a cross-modal gating mechanism. Moreover, the proposed Perturbation-Driven Consistency Alignment (PDCA) is a dual-path training strategy that enforces invariant feature alignment under random image perturbations by regularizing consistency between original and augmented cross-modal embeddings. Extensive experiments are conducted on several popular benchmarks covering Market1501, DukeMTMC-ReID, Occluded-Duke, Occluded-REID, and P-DukeMTMC, which demonstrate the impressive performance of the proposed method. Notably, our framework eliminates heavy adapters while maintaining efficient inference, achieving an optimal trade-off between performance and computational overhead. The code will be released upon acceptance. |
| title | SCING:Towards More Efficient and Robust Person Re-Identification through Selective Cross-modal Prompt Tuning |
| topic | Computer Vision and Pattern Recognition |
| url | https://arxiv.org/abs/2507.00506 |