Saved in:
Bibliographic Details
Main Authors: Ge, Mengying, Li, Mingyang, Tang, Dongkai, Li, Pengbo, Liu, Kuo, Deng, Shuhao, Pu, Songbai, Liu, Long, Song, Yang, Zhang, Tao
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2409.18971
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866912049545609216
author Ge, Mengying
Li, Mingyang
Tang, Dongkai
Li, Pengbo
Liu, Kuo
Deng, Shuhao
Pu, Songbai
Liu, Long
Song, Yang
Zhang, Tao
author_facet Ge, Mengying
Li, Mingyang
Tang, Dongkai
Li, Pengbo
Liu, Kuo
Deng, Shuhao
Pu, Songbai
Liu, Long
Song, Yang
Zhang, Tao
contents In this paper, we present our solutions for emotion recognition in the sub-challenges of Multimodal Emotion Recognition Challenge (MER2024). To mitigate the modal competition issue between audio and text, we adopt an early fusion strategy based on a large language model, where joint training of audio and text is conducted initially. And the joint Audio-Text modal feature will be late-fused with other unimodal features. In order to solve the problems of data insufficiency and class imbalance, We use multiple turns of multi-model voting for data mining. Moreover, to enhance the quality of audio features, we employ speech source separation to preprocess audios. Our model ranks \textbf{2nd} in both MER2024-SEMI and MER2024-NOISE, validating our method's effectiveness.
format Preprint
id arxiv_https___arxiv_org_abs_2409_18971
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Early Joint Learning of Emotion Information Makes MultiModal Model Understand You Better
Ge, Mengying
Li, Mingyang
Tang, Dongkai
Li, Pengbo
Liu, Kuo
Deng, Shuhao
Pu, Songbai
Liu, Long
Song, Yang
Zhang, Tao
Multimedia
Artificial Intelligence
Sound
Audio and Speech Processing
In this paper, we present our solutions for emotion recognition in the sub-challenges of Multimodal Emotion Recognition Challenge (MER2024). To mitigate the modal competition issue between audio and text, we adopt an early fusion strategy based on a large language model, where joint training of audio and text is conducted initially. And the joint Audio-Text modal feature will be late-fused with other unimodal features. In order to solve the problems of data insufficiency and class imbalance, We use multiple turns of multi-model voting for data mining. Moreover, to enhance the quality of audio features, we employ speech source separation to preprocess audios. Our model ranks \textbf{2nd} in both MER2024-SEMI and MER2024-NOISE, validating our method's effectiveness.
title Early Joint Learning of Emotion Information Makes MultiModal Model Understand You Better
topic Multimedia
Artificial Intelligence
Sound
Audio and Speech Processing
url https://arxiv.org/abs/2409.18971