Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Ge, Mengying, Li, Mingyang, Tang, Dongkai, Li, Pengbo, Liu, Kuo, Deng, Shuhao, Pu, Songbai, Liu, Long, Song, Yang, Zhang, Tao
Format:	Preprint
Published:	2024
Subjects:	Multimedia Artificial Intelligence Sound Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2409.18971
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912049545609216
author	Ge, Mengying Li, Mingyang Tang, Dongkai Li, Pengbo Liu, Kuo Deng, Shuhao Pu, Songbai Liu, Long Song, Yang Zhang, Tao
author_facet	Ge, Mengying Li, Mingyang Tang, Dongkai Li, Pengbo Liu, Kuo Deng, Shuhao Pu, Songbai Liu, Long Song, Yang Zhang, Tao
contents	In this paper, we present our solutions for emotion recognition in the sub-challenges of Multimodal Emotion Recognition Challenge (MER2024). To mitigate the modal competition issue between audio and text, we adopt an early fusion strategy based on a large language model, where joint training of audio and text is conducted initially. And the joint Audio-Text modal feature will be late-fused with other unimodal features. In order to solve the problems of data insufficiency and class imbalance, We use multiple turns of multi-model voting for data mining. Moreover, to enhance the quality of audio features, we employ speech source separation to preprocess audios. Our model ranks \textbf{2nd} in both MER2024-SEMI and MER2024-NOISE, validating our method's effectiveness.
format	Preprint
id	arxiv_https___arxiv_org_abs_2409_18971
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Early Joint Learning of Emotion Information Makes MultiModal Model Understand You Better Ge, Mengying Li, Mingyang Tang, Dongkai Li, Pengbo Liu, Kuo Deng, Shuhao Pu, Songbai Liu, Long Song, Yang Zhang, Tao Multimedia Artificial Intelligence Sound Audio and Speech Processing In this paper, we present our solutions for emotion recognition in the sub-challenges of Multimodal Emotion Recognition Challenge (MER2024). To mitigate the modal competition issue between audio and text, we adopt an early fusion strategy based on a large language model, where joint training of audio and text is conducted initially. And the joint Audio-Text modal feature will be late-fused with other unimodal features. In order to solve the problems of data insufficiency and class imbalance, We use multiple turns of multi-model voting for data mining. Moreover, to enhance the quality of audio features, we employ speech source separation to preprocess audios. Our model ranks \textbf{2nd} in both MER2024-SEMI and MER2024-NOISE, validating our method's effectiveness.
title	Early Joint Learning of Emotion Information Makes MultiModal Model Understand You Better
topic	Multimedia Artificial Intelligence Sound Audio and Speech Processing
url	https://arxiv.org/abs/2409.18971

Similar Items