Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Ran, Ran, Wei, Jiwei, Zhou, Shuchang, Qin, Yitong, He, Shiyuan, Ma, Zeyu, Zhou, Yuyang, Yang, Yang
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2605.03398
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909013616099328
author	Ran, Ran Wei, Jiwei Zhou, Shuchang Qin, Yitong He, Shiyuan Ma, Zeyu Zhou, Yuyang Yang, Yang
author_facet	Ran, Ran Wei, Jiwei Zhou, Shuchang Qin, Yitong He, Shiyuan Ma, Zeyu Zhou, Yuyang Yang, Yang
contents	Video Temporal Grounding (VTG) faces a cross-modal semantic gap that often leads to background features being incorrectly aligned with the query, while directly matching the query to moments results in insufficient discriminability and consistency of temporal semantics. To address this issue, we propose MLLM-Assisted Semantic-Relational Consistent Alignment (MASRA), a training-time MLLM-based optimization framework for VTG. MASRA leverages an MLLM during training to produce two forms of textual priors, namely event-level descriptions with temporal spans and clip-level captions, and instantiates two MLLM-assisted alignments. Event Semantic Temporal Alignment (ESTA) aligns temporal context with event semantics to explicitly strengthen the correspondence between semantics and temporal events and improve span-level separability. Local Relational Consistency Alignment (LRCA) constructs a textual relation matrix derived from clip-level captions and aligns it with the temporal feature similarity matrix in the model, enhancing temporal consistency while capturing local structural information. MASRA includes two simple supporting modules, semantic-guided enhancement and second-order relational attention, to better utilize the learned semantic context and relational structure. Moreover, we introduce Decoupled Alignment Interaction (DAI) with a context-aware codebook to adaptively absorb query-irrelevant semantics and alleviate the cross-modal gap. The MLLM is only invoked during training and is not used at inference. Extensive experiments show that MASRA outperforms existing methods, and ablation studies validate its effectiveness.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_03398
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	MASRA: MLLM-Assisted Semantic-Relational Consistent Alignment for Video Temporal Grounding Ran, Ran Wei, Jiwei Zhou, Shuchang Qin, Yitong He, Shiyuan Ma, Zeyu Zhou, Yuyang Yang, Yang Computer Vision and Pattern Recognition Video Temporal Grounding (VTG) faces a cross-modal semantic gap that often leads to background features being incorrectly aligned with the query, while directly matching the query to moments results in insufficient discriminability and consistency of temporal semantics. To address this issue, we propose MLLM-Assisted Semantic-Relational Consistent Alignment (MASRA), a training-time MLLM-based optimization framework for VTG. MASRA leverages an MLLM during training to produce two forms of textual priors, namely event-level descriptions with temporal spans and clip-level captions, and instantiates two MLLM-assisted alignments. Event Semantic Temporal Alignment (ESTA) aligns temporal context with event semantics to explicitly strengthen the correspondence between semantics and temporal events and improve span-level separability. Local Relational Consistency Alignment (LRCA) constructs a textual relation matrix derived from clip-level captions and aligns it with the temporal feature similarity matrix in the model, enhancing temporal consistency while capturing local structural information. MASRA includes two simple supporting modules, semantic-guided enhancement and second-order relational attention, to better utilize the learned semantic context and relational structure. Moreover, we introduce Decoupled Alignment Interaction (DAI) with a context-aware codebook to adaptively absorb query-irrelevant semantics and alleviate the cross-modal gap. The MLLM is only invoked during training and is not used at inference. Extensive experiments show that MASRA outperforms existing methods, and ablation studies validate its effectiveness.
title	MASRA: MLLM-Assisted Semantic-Relational Consistent Alignment for Video Temporal Grounding
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2605.03398

Similar Items