Saved in:
Bibliographic Details
Main Authors: Kuşçu, Gökhan, Erzin, Engin
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2406.02569
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911904686931968
author Kuşçu, Gökhan
Erzin, Engin
author_facet Kuşçu, Gökhan
Erzin, Engin
contents Continuous emotion recognition (CER) aims to track the dynamic changes in a person's emotional state over time. This paper proposes a novel approach to translating CER into a prediction problem of dynamic affect-contour clusters from speech, where the affect-contour is defined as the contour of annotated affect attributes in a temporal window. Our approach defines a cluster-to-predict (C2P) framework that learns affect-contour clusters, which are predicted from speech with higher precision. To achieve this, C2P runs an unsupervised iterative optimization process to learn affect-contour clusters by minimizing both clustering loss and speech-driven affect-contour prediction loss. Our objective findings demonstrate the value of speech-driven clustering for both arousal and valence attributes. Experiments conducted on the RECOLA dataset yielded promising classification results, with F1 scores of 0.84 for arousal and 0.75 for valence in our four-class speech-driven affect-contour prediction model.
format Preprint
id arxiv_https___arxiv_org_abs_2406_02569
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Cluster-to-Predict Affect Contours from Speech
Kuşçu, Gökhan
Erzin, Engin
Audio and Speech Processing
Human-Computer Interaction
Continuous emotion recognition (CER) aims to track the dynamic changes in a person's emotional state over time. This paper proposes a novel approach to translating CER into a prediction problem of dynamic affect-contour clusters from speech, where the affect-contour is defined as the contour of annotated affect attributes in a temporal window. Our approach defines a cluster-to-predict (C2P) framework that learns affect-contour clusters, which are predicted from speech with higher precision. To achieve this, C2P runs an unsupervised iterative optimization process to learn affect-contour clusters by minimizing both clustering loss and speech-driven affect-contour prediction loss. Our objective findings demonstrate the value of speech-driven clustering for both arousal and valence attributes. Experiments conducted on the RECOLA dataset yielded promising classification results, with F1 scores of 0.84 for arousal and 0.75 for valence in our four-class speech-driven affect-contour prediction model.
title Cluster-to-Predict Affect Contours from Speech
topic Audio and Speech Processing
Human-Computer Interaction
url https://arxiv.org/abs/2406.02569