Internformat: :: Library Catalog

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Guo, Yaowei, Xing, Jiazheng, Hou, Xiaojun, Xin, Shuo, Jiang, Juntao, Terzopoulos, Demetri, Jiang, Chenfanfu, Liu, Yong
Format:	Preprint
Veröffentlicht:	2025
Schlagworte:	Computer Vision and Pattern Recognition
Online-Zugang:	https://arxiv.org/abs/2503.00364
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

_version_	1866910851913482240
author	Guo, Yaowei Xing, Jiazheng Hou, Xiaojun Xin, Shuo Jiang, Juntao Terzopoulos, Demetri Jiang, Chenfanfu Liu, Yong
author_facet	Guo, Yaowei Xing, Jiazheng Hou, Xiaojun Xin, Shuo Jiang, Juntao Terzopoulos, Demetri Jiang, Chenfanfu Liu, Yong
contents	Video summarization, by selecting the most informative and/or user-relevant parts of original videos to create concise summary videos, has high research value and consumer demand in today's video proliferation era. Multi-modal video summarization that accomodates user input has become a research hotspot. However, current multi-modal video summarization methods suffer from two limitations. First, existing methods inadequately fuse information from different modalities and cannot effectively utilize modality-unique features. Second, most multi-modal methods focus on video and text modalities, neglecting the audio modality, despite the fact that audio information can be very useful in certain types of videos. In this paper we propose CFSum, a transformer-based multi-modal video summarization framework with coarse-fine fusion. CFSum exploits video, text, and audio modal features as input, and incorporates a two-stage transformer-based feature fusion framework to fully utilize modality-unique information. In the first stage, multi-modal features are fused simultaneously to perform initial coarse-grained feature fusion, then, in the second stage, video and audio features are explicitly attended with the text representation yielding more fine-grained information interaction. The CFSum architecture gives equal importance to each modality, ensuring that each modal feature interacts deeply with the other modalities. Our extensive comparative experiments against prior methods and ablation studies on various datasets confirm the effectiveness and superiority of CFSum.
format	Preprint
id	arxiv_https___arxiv_org_abs_2503_00364
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	CFSum: A Transformer-Based Multi-Modal Video Summarization Framework With Coarse-Fine Fusion Guo, Yaowei Xing, Jiazheng Hou, Xiaojun Xin, Shuo Jiang, Juntao Terzopoulos, Demetri Jiang, Chenfanfu Liu, Yong Computer Vision and Pattern Recognition Video summarization, by selecting the most informative and/or user-relevant parts of original videos to create concise summary videos, has high research value and consumer demand in today's video proliferation era. Multi-modal video summarization that accomodates user input has become a research hotspot. However, current multi-modal video summarization methods suffer from two limitations. First, existing methods inadequately fuse information from different modalities and cannot effectively utilize modality-unique features. Second, most multi-modal methods focus on video and text modalities, neglecting the audio modality, despite the fact that audio information can be very useful in certain types of videos. In this paper we propose CFSum, a transformer-based multi-modal video summarization framework with coarse-fine fusion. CFSum exploits video, text, and audio modal features as input, and incorporates a two-stage transformer-based feature fusion framework to fully utilize modality-unique information. In the first stage, multi-modal features are fused simultaneously to perform initial coarse-grained feature fusion, then, in the second stage, video and audio features are explicitly attended with the text representation yielding more fine-grained information interaction. The CFSum architecture gives equal importance to each modality, ensuring that each modal feature interacts deeply with the other modalities. Our extensive comparative experiments against prior methods and ablation studies on various datasets confirm the effectiveness and superiority of CFSum.
title	CFSum: A Transformer-Based Multi-Modal Video Summarization Framework With Coarse-Fine Fusion
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2503.00364

Ähnliche Einträge