Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wu, Minghui, Zhao, Chenxu, Su, Anyang, Di, Donglin, Fu, Tianyu, An, Da, He, Min, Gao, Ya, Ma, Meng, Yan, Kun, Wang, Ping
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2407.08150
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917768950972416
author	Wu, Minghui Zhao, Chenxu Su, Anyang Di, Donglin Fu, Tianyu An, Da He, Min Gao, Ya Ma, Meng Yan, Kun Wang, Ping
author_facet	Wu, Minghui Zhao, Chenxu Su, Anyang Di, Donglin Fu, Tianyu An, Da He, Min Gao, Ya Ma, Meng Yan, Kun Wang, Ping
contents	Understanding of video creativity and content often varies among individuals, with differences in focal points and cognitive levels across different ages, experiences, and genders. There is currently a lack of research in this area, and most existing benchmarks suffer from several drawbacks: 1) a limited number of modalities and answers with restrictive length; 2) the content and scenarios within the videos are excessively monotonous, transmitting allegories and emotions that are overly simplistic. To bridge the gap to real-world applications, we introduce a large-scale Subjective Response Indicators for Advertisement Videos dataset, namely SRI-ADV. Specifically, we collected real changes in Electroencephalographic (EEG) and eye-tracking regions from different demographics while they viewed identical video content. Utilizing this multi-modal dataset, we developed tasks and protocols to analyze and evaluate the extent of cognitive understanding of video content among different users. Along with the dataset, we designed a Hypergraph Multi-modal Large Language Model (HMLLM) to explore the associations among different demographics, video elements, EEG, and eye-tracking indicators. HMLLM could bridge semantic gaps across rich modalities and integrate information beyond different modalities to perform logical reasoning. Extensive experimental evaluations on SRI-ADV and other additional video-based generative performance benchmarks demonstrate the effectiveness of our method. The codes and dataset will be released at https://github.com/mininglamp-MLLM/HMLLM.
format	Preprint
id	arxiv_https___arxiv_org_abs_2407_08150
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Hypergraph Multi-modal Large Language Model: Exploiting EEG and Eye-tracking Modalities to Evaluate Heterogeneous Responses for Video Understanding Wu, Minghui Zhao, Chenxu Su, Anyang Di, Donglin Fu, Tianyu An, Da He, Min Gao, Ya Ma, Meng Yan, Kun Wang, Ping Computer Vision and Pattern Recognition Understanding of video creativity and content often varies among individuals, with differences in focal points and cognitive levels across different ages, experiences, and genders. There is currently a lack of research in this area, and most existing benchmarks suffer from several drawbacks: 1) a limited number of modalities and answers with restrictive length; 2) the content and scenarios within the videos are excessively monotonous, transmitting allegories and emotions that are overly simplistic. To bridge the gap to real-world applications, we introduce a large-scale Subjective Response Indicators for Advertisement Videos dataset, namely SRI-ADV. Specifically, we collected real changes in Electroencephalographic (EEG) and eye-tracking regions from different demographics while they viewed identical video content. Utilizing this multi-modal dataset, we developed tasks and protocols to analyze and evaluate the extent of cognitive understanding of video content among different users. Along with the dataset, we designed a Hypergraph Multi-modal Large Language Model (HMLLM) to explore the associations among different demographics, video elements, EEG, and eye-tracking indicators. HMLLM could bridge semantic gaps across rich modalities and integrate information beyond different modalities to perform logical reasoning. Extensive experimental evaluations on SRI-ADV and other additional video-based generative performance benchmarks demonstrate the effectiveness of our method. The codes and dataset will be released at https://github.com/mininglamp-MLLM/HMLLM.
title	Hypergraph Multi-modal Large Language Model: Exploiting EEG and Eye-tracking Modalities to Evaluate Heterogeneous Responses for Video Understanding
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2407.08150

Similar Items