Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Shi, Pujin, Gao, Fei
Format:	Preprint
Published:	2024
Subjects:	Sound Artificial Intelligence Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2409.05007
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914943376293888
author	Shi, Pujin Gao, Fei
author_facet	Shi, Pujin Gao, Fei
contents	In this paper, we propose a solution for the semi-supervised learning track (MER-SEMI) in MER2024. First, in order to enhance the performance of the feature extractor on sentiment classification tasks,we fine-tuned video and text feature extractors, specifically CLIP-vit-large and Baichuan-13B, using labeled data. This approach effectively preserves the original emotional information conveyed in the videos. Second, we propose an Audio-Guided Transformer (AGT) fusion mechanism, which leverages the robustness of Hubert-large, showing superior effectiveness in fusing both inter-channel and intra-channel information. Third, To enhance the accuracy of the model, we iteratively apply self-supervised learning by using high-confidence unlabeled data as pseudo-labels. Finally, through black-box probing, we discovered an imbalanced data distribution between the training and test sets. Therefore, We adopt a prior-knowledge-based voting mechanism. The results demonstrate the effectiveness of our strategy, ultimately earning us third place in the MER-SEMI track.
format	Preprint
id	arxiv_https___arxiv_org_abs_2409_05007
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Audio-Guided Fusion Techniques for Multimodal Emotion Analysis Shi, Pujin Gao, Fei Sound Artificial Intelligence Audio and Speech Processing In this paper, we propose a solution for the semi-supervised learning track (MER-SEMI) in MER2024. First, in order to enhance the performance of the feature extractor on sentiment classification tasks,we fine-tuned video and text feature extractors, specifically CLIP-vit-large and Baichuan-13B, using labeled data. This approach effectively preserves the original emotional information conveyed in the videos. Second, we propose an Audio-Guided Transformer (AGT) fusion mechanism, which leverages the robustness of Hubert-large, showing superior effectiveness in fusing both inter-channel and intra-channel information. Third, To enhance the accuracy of the model, we iteratively apply self-supervised learning by using high-confidence unlabeled data as pseudo-labels. Finally, through black-box probing, we discovered an imbalanced data distribution between the training and test sets. Therefore, We adopt a prior-knowledge-based voting mechanism. The results demonstrate the effectiveness of our strategy, ultimately earning us third place in the MER-SEMI track.
title	Audio-Guided Fusion Techniques for Multimodal Emotion Analysis
topic	Sound Artificial Intelligence Audio and Speech Processing
url	https://arxiv.org/abs/2409.05007

Similar Items