Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Ying, Xinru, Mo, Jiaqi, Lin, Jingyang, Jin, Canghong, Wang, Fangfang, Wei, Lina
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2506.03473
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915324018819072
author	Ying, Xinru Mo, Jiaqi Lin, Jingyang Jin, Canghong Wang, Fangfang Wei, Lina
author_facet	Ying, Xinru Mo, Jiaqi Lin, Jingyang Jin, Canghong Wang, Fangfang Wei, Lina
contents	Partially Relevant Video Retrieval (PRVR) is a challenging task in the domain of multimedia retrieval. It is designed to identify and retrieve untrimmed videos that are partially relevant to the provided query. In this work, we investigate long-sequence video content understanding to address information redundancy issues. Leveraging the outstanding long-term state space modeling capability and linear scalability of the Mamba module, we introduce a multi-Mamba module with temporal fusion framework (MamFusion) tailored for PRVR task. This framework effectively captures the state-relatedness in long-term video content and seamlessly integrates it into text-video relevance understanding, thereby enhancing the retrieval process. Specifically, we introduce Temporal T-to-V Fusion and Temporal V-to-T Fusion to explicitly model temporal relationships between text queries and video moments, improving contextual awareness and retrieval accuracy. Extensive experiments conducted on large-scale datasets demonstrate that MamFusion achieves state-of-the-art performance in retrieval effectiveness. Code is available at the link: https://github.com/Vision-Multimodal-Lab-HZCU/MamFusion.
format	Preprint
id	arxiv_https___arxiv_org_abs_2506_03473
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	MamFusion: Multi-Mamba with Temporal Fusion for Partially Relevant Video Retrieval Ying, Xinru Mo, Jiaqi Lin, Jingyang Jin, Canghong Wang, Fangfang Wei, Lina Computer Vision and Pattern Recognition Partially Relevant Video Retrieval (PRVR) is a challenging task in the domain of multimedia retrieval. It is designed to identify and retrieve untrimmed videos that are partially relevant to the provided query. In this work, we investigate long-sequence video content understanding to address information redundancy issues. Leveraging the outstanding long-term state space modeling capability and linear scalability of the Mamba module, we introduce a multi-Mamba module with temporal fusion framework (MamFusion) tailored for PRVR task. This framework effectively captures the state-relatedness in long-term video content and seamlessly integrates it into text-video relevance understanding, thereby enhancing the retrieval process. Specifically, we introduce Temporal T-to-V Fusion and Temporal V-to-T Fusion to explicitly model temporal relationships between text queries and video moments, improving contextual awareness and retrieval accuracy. Extensive experiments conducted on large-scale datasets demonstrate that MamFusion achieves state-of-the-art performance in retrieval effectiveness. Code is available at the link: https://github.com/Vision-Multimodal-Lab-HZCU/MamFusion.
title	MamFusion: Multi-Mamba with Temporal Fusion for Partially Relevant Video Retrieval
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2506.03473

Similar Items