Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Mandal, Nischal, Li, Yang
Format:	Preprint
Published:	2025
Subjects:	Computation and Language Artificial Intelligence
Online Access:	https://arxiv.org/abs/2505.04642
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909604368089088
author	Mandal, Nischal Li, Yang
author_facet	Mandal, Nischal Li, Yang
contents	Multimodal sentiment analysis, a pivotal task in affective computing, seeks to understand human emotions by integrating cues from language, audio, and visual signals. While many recent approaches leverage complex attention mechanisms and hierarchical architectures, we propose a lightweight, yet effective fusion-based deep learning model tailored for utterance-level emotion classification. Using the benchmark IEMOCAP dataset, which includes aligned text, audio-derived numeric features, and visual descriptors, we design a modality-specific encoder using fully connected layers followed by dropout regularization. The modality-specific representations are then fused using simple concatenation and passed through a dense fusion layer to capture cross-modal interactions. This streamlined architecture avoids computational overhead while preserving performance, achieving a classification accuracy of 92% across six emotion categories. Our approach demonstrates that with careful feature engineering and modular design, simpler fusion strategies can outperform or match more complex models, particularly in resource-constrained environments.
format	Preprint
id	arxiv_https___arxiv_org_abs_2505_04642
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Rethinking Multimodal Sentiment Analysis: A High-Accuracy, Simplified Fusion Architecture Mandal, Nischal Li, Yang Computation and Language Artificial Intelligence Multimodal sentiment analysis, a pivotal task in affective computing, seeks to understand human emotions by integrating cues from language, audio, and visual signals. While many recent approaches leverage complex attention mechanisms and hierarchical architectures, we propose a lightweight, yet effective fusion-based deep learning model tailored for utterance-level emotion classification. Using the benchmark IEMOCAP dataset, which includes aligned text, audio-derived numeric features, and visual descriptors, we design a modality-specific encoder using fully connected layers followed by dropout regularization. The modality-specific representations are then fused using simple concatenation and passed through a dense fusion layer to capture cross-modal interactions. This streamlined architecture avoids computational overhead while preserving performance, achieving a classification accuracy of 92% across six emotion categories. Our approach demonstrates that with careful feature engineering and modular design, simpler fusion strategies can outperform or match more complex models, particularly in resource-constrained environments.
title	Rethinking Multimodal Sentiment Analysis: A High-Accuracy, Simplified Fusion Architecture
topic	Computation and Language Artificial Intelligence
url	https://arxiv.org/abs/2505.04642

Similar Items