Saved in:
Bibliographic Details
Main Authors: Wu, Minghui, Zhao, Chenxu, Su, Anyang, Di, Donglin, Fu, Tianyu, An, Da, He, Min, Gao, Ya, Ma, Meng, Yan, Kun, Wang, Ping
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2407.08150
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866917768950972416
author Wu, Minghui
Zhao, Chenxu
Su, Anyang
Di, Donglin
Fu, Tianyu
An, Da
He, Min
Gao, Ya
Ma, Meng
Yan, Kun
Wang, Ping
author_facet Wu, Minghui
Zhao, Chenxu
Su, Anyang
Di, Donglin
Fu, Tianyu
An, Da
He, Min
Gao, Ya
Ma, Meng
Yan, Kun
Wang, Ping
contents Understanding of video creativity and content often varies among individuals, with differences in focal points and cognitive levels across different ages, experiences, and genders. There is currently a lack of research in this area, and most existing benchmarks suffer from several drawbacks: 1) a limited number of modalities and answers with restrictive length; 2) the content and scenarios within the videos are excessively monotonous, transmitting allegories and emotions that are overly simplistic. To bridge the gap to real-world applications, we introduce a large-scale Subjective Response Indicators for Advertisement Videos dataset, namely SRI-ADV. Specifically, we collected real changes in Electroencephalographic (EEG) and eye-tracking regions from different demographics while they viewed identical video content. Utilizing this multi-modal dataset, we developed tasks and protocols to analyze and evaluate the extent of cognitive understanding of video content among different users. Along with the dataset, we designed a Hypergraph Multi-modal Large Language Model (HMLLM) to explore the associations among different demographics, video elements, EEG, and eye-tracking indicators. HMLLM could bridge semantic gaps across rich modalities and integrate information beyond different modalities to perform logical reasoning. Extensive experimental evaluations on SRI-ADV and other additional video-based generative performance benchmarks demonstrate the effectiveness of our method. The codes and dataset will be released at https://github.com/mininglamp-MLLM/HMLLM.
format Preprint
id arxiv_https___arxiv_org_abs_2407_08150
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Hypergraph Multi-modal Large Language Model: Exploiting EEG and Eye-tracking Modalities to Evaluate Heterogeneous Responses for Video Understanding
Wu, Minghui
Zhao, Chenxu
Su, Anyang
Di, Donglin
Fu, Tianyu
An, Da
He, Min
Gao, Ya
Ma, Meng
Yan, Kun
Wang, Ping
Computer Vision and Pattern Recognition
Understanding of video creativity and content often varies among individuals, with differences in focal points and cognitive levels across different ages, experiences, and genders. There is currently a lack of research in this area, and most existing benchmarks suffer from several drawbacks: 1) a limited number of modalities and answers with restrictive length; 2) the content and scenarios within the videos are excessively monotonous, transmitting allegories and emotions that are overly simplistic. To bridge the gap to real-world applications, we introduce a large-scale Subjective Response Indicators for Advertisement Videos dataset, namely SRI-ADV. Specifically, we collected real changes in Electroencephalographic (EEG) and eye-tracking regions from different demographics while they viewed identical video content. Utilizing this multi-modal dataset, we developed tasks and protocols to analyze and evaluate the extent of cognitive understanding of video content among different users. Along with the dataset, we designed a Hypergraph Multi-modal Large Language Model (HMLLM) to explore the associations among different demographics, video elements, EEG, and eye-tracking indicators. HMLLM could bridge semantic gaps across rich modalities and integrate information beyond different modalities to perform logical reasoning. Extensive experimental evaluations on SRI-ADV and other additional video-based generative performance benchmarks demonstrate the effectiveness of our method. The codes and dataset will be released at https://github.com/mininglamp-MLLM/HMLLM.
title Hypergraph Multi-modal Large Language Model: Exploiting EEG and Eye-tracking Modalities to Evaluate Heterogeneous Responses for Video Understanding
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2407.08150