Saved in:
Bibliographic Details
Main Authors: Jin, Peng, Li, Hao, Yuan, Li, Yan, Shuicheng, Chen, Jie
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2412.20964
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866912173175865344
author Jin, Peng
Li, Hao
Yuan, Li
Yan, Shuicheng
Chen, Jie
author_facet Jin, Peng
Li, Hao
Yuan, Li
Yan, Shuicheng
Chen, Jie
contents Multimodal representation learning, with contrastive learning, plays an important role in the artificial intelligence domain. As an important subfield, video-language representation learning focuses on learning representations using global semantic interactions between pre-defined video-text pairs. However, to enhance and refine such coarse-grained global interactions, more detailed interactions are necessary for fine-grained multimodal learning. In this study, we introduce a new approach that models video-text as game players using multivariate cooperative game theory to handle uncertainty during fine-grained semantic interactions with diverse granularity, flexible combination, and vague intensity. Specifically, we design the Hierarchical Banzhaf Interaction to simulate the fine-grained correspondence between video clips and textual words from hierarchical perspectives. Furthermore, to mitigate the bias in calculations within Banzhaf Interaction, we propose reconstructing the representation through a fusion of single-modal and cross-modal components. This reconstructed representation ensures fine granularity comparable to that of the single-modal representation, while also preserving the adaptive encoding characteristics of cross-modal representation. Additionally, we extend our original structure into a flexible encoder-decoder framework, enabling the model to adapt to various downstream tasks. Extensive experiments on commonly used text-video retrieval, video-question answering, and video captioning benchmarks, with superior performance, validate the effectiveness and generalization of our method.
format Preprint
id arxiv_https___arxiv_org_abs_2412_20964
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Hierarchical Banzhaf Interaction for General Video-Language Representation Learning
Jin, Peng
Li, Hao
Yuan, Li
Yan, Shuicheng
Chen, Jie
Computer Vision and Pattern Recognition
Multimodal representation learning, with contrastive learning, plays an important role in the artificial intelligence domain. As an important subfield, video-language representation learning focuses on learning representations using global semantic interactions between pre-defined video-text pairs. However, to enhance and refine such coarse-grained global interactions, more detailed interactions are necessary for fine-grained multimodal learning. In this study, we introduce a new approach that models video-text as game players using multivariate cooperative game theory to handle uncertainty during fine-grained semantic interactions with diverse granularity, flexible combination, and vague intensity. Specifically, we design the Hierarchical Banzhaf Interaction to simulate the fine-grained correspondence between video clips and textual words from hierarchical perspectives. Furthermore, to mitigate the bias in calculations within Banzhaf Interaction, we propose reconstructing the representation through a fusion of single-modal and cross-modal components. This reconstructed representation ensures fine granularity comparable to that of the single-modal representation, while also preserving the adaptive encoding characteristics of cross-modal representation. Additionally, we extend our original structure into a flexible encoder-decoder framework, enabling the model to adapt to various downstream tasks. Extensive experiments on commonly used text-video retrieval, video-question answering, and video captioning benchmarks, with superior performance, validate the effectiveness and generalization of our method.
title Hierarchical Banzhaf Interaction for General Video-Language Representation Learning
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2412.20964