Saved in:
Bibliographic Details
Main Authors: Li, Jiagen, Yu, Rui, Huang, Huihao, Yan, Huaicheng
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2503.23721
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866909559332798464
author Li, Jiagen
Yu, Rui
Huang, Huihao
Yan, Huaicheng
author_facet Li, Jiagen
Yu, Rui
Huang, Huihao
Yan, Huaicheng
contents Multimodal Emotion Recognition in Conversations (MERC) identifies emotional states across text, audio and video, which is essential for intelligent dialogue systems and opinion analysis. Existing methods emphasize heterogeneous modal fusion directly for cross-modal integration, but often suffer from disorientation in multimodal learning due to modal heterogeneity and lack of instructive guidance. In this work, we propose SUMMER, a novel heterogeneous multimodal integration framework leveraging Mixture of Experts with Hierarchical Cross-modal Fusion and Interactive Knowledge Distillation. Key components include a Sparse Dynamic Mixture of Experts (SDMoE) for capturing dynamic token-wise interactions, a Hierarchical Cross-Modal Fusion (HCMF) for effective fusion of heterogeneous modalities, and Interactive Knowledge Distillation (IKD), which uses a pre-trained unimodal teacher to guide multimodal fusion in latent and logit spaces. Experiments on IEMOCAP and MELD show SUMMER outperforms state-of-the-art methods, particularly in recognizing minority and semantically similar emotions.
format Preprint
id arxiv_https___arxiv_org_abs_2503_23721
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Unimodal-driven Distillation in Multimodal Emotion Recognition with Dynamic Fusion
Li, Jiagen
Yu, Rui
Huang, Huihao
Yan, Huaicheng
Machine Learning
Artificial Intelligence
Multimodal Emotion Recognition in Conversations (MERC) identifies emotional states across text, audio and video, which is essential for intelligent dialogue systems and opinion analysis. Existing methods emphasize heterogeneous modal fusion directly for cross-modal integration, but often suffer from disorientation in multimodal learning due to modal heterogeneity and lack of instructive guidance. In this work, we propose SUMMER, a novel heterogeneous multimodal integration framework leveraging Mixture of Experts with Hierarchical Cross-modal Fusion and Interactive Knowledge Distillation. Key components include a Sparse Dynamic Mixture of Experts (SDMoE) for capturing dynamic token-wise interactions, a Hierarchical Cross-Modal Fusion (HCMF) for effective fusion of heterogeneous modalities, and Interactive Knowledge Distillation (IKD), which uses a pre-trained unimodal teacher to guide multimodal fusion in latent and logit spaces. Experiments on IEMOCAP and MELD show SUMMER outperforms state-of-the-art methods, particularly in recognizing minority and semantically similar emotions.
title Unimodal-driven Distillation in Multimodal Emotion Recognition with Dynamic Fusion
topic Machine Learning
Artificial Intelligence
url https://arxiv.org/abs/2503.23721