Saved in:
Bibliographic Details
Main Authors: Shi, Pujin, Gao, Fei
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2409.05007
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866914943376293888
author Shi, Pujin
Gao, Fei
author_facet Shi, Pujin
Gao, Fei
contents In this paper, we propose a solution for the semi-supervised learning track (MER-SEMI) in MER2024. First, in order to enhance the performance of the feature extractor on sentiment classification tasks,we fine-tuned video and text feature extractors, specifically CLIP-vit-large and Baichuan-13B, using labeled data. This approach effectively preserves the original emotional information conveyed in the videos. Second, we propose an Audio-Guided Transformer (AGT) fusion mechanism, which leverages the robustness of Hubert-large, showing superior effectiveness in fusing both inter-channel and intra-channel information. Third, To enhance the accuracy of the model, we iteratively apply self-supervised learning by using high-confidence unlabeled data as pseudo-labels. Finally, through black-box probing, we discovered an imbalanced data distribution between the training and test sets. Therefore, We adopt a prior-knowledge-based voting mechanism. The results demonstrate the effectiveness of our strategy, ultimately earning us third place in the MER-SEMI track.
format Preprint
id arxiv_https___arxiv_org_abs_2409_05007
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Audio-Guided Fusion Techniques for Multimodal Emotion Analysis
Shi, Pujin
Gao, Fei
Sound
Artificial Intelligence
Audio and Speech Processing
In this paper, we propose a solution for the semi-supervised learning track (MER-SEMI) in MER2024. First, in order to enhance the performance of the feature extractor on sentiment classification tasks,we fine-tuned video and text feature extractors, specifically CLIP-vit-large and Baichuan-13B, using labeled data. This approach effectively preserves the original emotional information conveyed in the videos. Second, we propose an Audio-Guided Transformer (AGT) fusion mechanism, which leverages the robustness of Hubert-large, showing superior effectiveness in fusing both inter-channel and intra-channel information. Third, To enhance the accuracy of the model, we iteratively apply self-supervised learning by using high-confidence unlabeled data as pseudo-labels. Finally, through black-box probing, we discovered an imbalanced data distribution between the training and test sets. Therefore, We adopt a prior-knowledge-based voting mechanism. The results demonstrate the effectiveness of our strategy, ultimately earning us third place in the MER-SEMI track.
title Audio-Guided Fusion Techniques for Multimodal Emotion Analysis
topic Sound
Artificial Intelligence
Audio and Speech Processing
url https://arxiv.org/abs/2409.05007