Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Jin, Peng, Li, Hao, Yuan, Li, Yan, Shuicheng, Chen, Jie
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2412.20964
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912173175865344
author	Jin, Peng Li, Hao Yuan, Li Yan, Shuicheng Chen, Jie
author_facet	Jin, Peng Li, Hao Yuan, Li Yan, Shuicheng Chen, Jie
contents	Multimodal representation learning, with contrastive learning, plays an important role in the artificial intelligence domain. As an important subfield, video-language representation learning focuses on learning representations using global semantic interactions between pre-defined video-text pairs. However, to enhance and refine such coarse-grained global interactions, more detailed interactions are necessary for fine-grained multimodal learning. In this study, we introduce a new approach that models video-text as game players using multivariate cooperative game theory to handle uncertainty during fine-grained semantic interactions with diverse granularity, flexible combination, and vague intensity. Specifically, we design the Hierarchical Banzhaf Interaction to simulate the fine-grained correspondence between video clips and textual words from hierarchical perspectives. Furthermore, to mitigate the bias in calculations within Banzhaf Interaction, we propose reconstructing the representation through a fusion of single-modal and cross-modal components. This reconstructed representation ensures fine granularity comparable to that of the single-modal representation, while also preserving the adaptive encoding characteristics of cross-modal representation. Additionally, we extend our original structure into a flexible encoder-decoder framework, enabling the model to adapt to various downstream tasks. Extensive experiments on commonly used text-video retrieval, video-question answering, and video captioning benchmarks, with superior performance, validate the effectiveness and generalization of our method.
format	Preprint
id	arxiv_https___arxiv_org_abs_2412_20964
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Hierarchical Banzhaf Interaction for General Video-Language Representation Learning Jin, Peng Li, Hao Yuan, Li Yan, Shuicheng Chen, Jie Computer Vision and Pattern Recognition Multimodal representation learning, with contrastive learning, plays an important role in the artificial intelligence domain. As an important subfield, video-language representation learning focuses on learning representations using global semantic interactions between pre-defined video-text pairs. However, to enhance and refine such coarse-grained global interactions, more detailed interactions are necessary for fine-grained multimodal learning. In this study, we introduce a new approach that models video-text as game players using multivariate cooperative game theory to handle uncertainty during fine-grained semantic interactions with diverse granularity, flexible combination, and vague intensity. Specifically, we design the Hierarchical Banzhaf Interaction to simulate the fine-grained correspondence between video clips and textual words from hierarchical perspectives. Furthermore, to mitigate the bias in calculations within Banzhaf Interaction, we propose reconstructing the representation through a fusion of single-modal and cross-modal components. This reconstructed representation ensures fine granularity comparable to that of the single-modal representation, while also preserving the adaptive encoding characteristics of cross-modal representation. Additionally, we extend our original structure into a flexible encoder-decoder framework, enabling the model to adapt to various downstream tasks. Extensive experiments on commonly used text-video retrieval, video-question answering, and video captioning benchmarks, with superior performance, validate the effectiveness and generalization of our method.
title	Hierarchical Banzhaf Interaction for General Video-Language Representation Learning
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2412.20964

Similar Items