Saved in:
| Main Authors: | , |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2406.02569 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866911904686931968 |
|---|---|
| author | Kuşçu, Gökhan Erzin, Engin |
| author_facet | Kuşçu, Gökhan Erzin, Engin |
| contents | Continuous emotion recognition (CER) aims to track the dynamic changes in a person's emotional state over time. This paper proposes a novel approach to translating CER into a prediction problem of dynamic affect-contour clusters from speech, where the affect-contour is defined as the contour of annotated affect attributes in a temporal window. Our approach defines a cluster-to-predict (C2P) framework that learns affect-contour clusters, which are predicted from speech with higher precision. To achieve this, C2P runs an unsupervised iterative optimization process to learn affect-contour clusters by minimizing both clustering loss and speech-driven affect-contour prediction loss. Our objective findings demonstrate the value of speech-driven clustering for both arousal and valence attributes. Experiments conducted on the RECOLA dataset yielded promising classification results, with F1 scores of 0.84 for arousal and 0.75 for valence in our four-class speech-driven affect-contour prediction model. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2406_02569 |
| institution | arXiv |
| publishDate | 2024 |
| record_format | arxiv |
| spellingShingle | Cluster-to-Predict Affect Contours from Speech Kuşçu, Gökhan Erzin, Engin Audio and Speech Processing Human-Computer Interaction Continuous emotion recognition (CER) aims to track the dynamic changes in a person's emotional state over time. This paper proposes a novel approach to translating CER into a prediction problem of dynamic affect-contour clusters from speech, where the affect-contour is defined as the contour of annotated affect attributes in a temporal window. Our approach defines a cluster-to-predict (C2P) framework that learns affect-contour clusters, which are predicted from speech with higher precision. To achieve this, C2P runs an unsupervised iterative optimization process to learn affect-contour clusters by minimizing both clustering loss and speech-driven affect-contour prediction loss. Our objective findings demonstrate the value of speech-driven clustering for both arousal and valence attributes. Experiments conducted on the RECOLA dataset yielded promising classification results, with F1 scores of 0.84 for arousal and 0.75 for valence in our four-class speech-driven affect-contour prediction model. |
| title | Cluster-to-Predict Affect Contours from Speech |
| topic | Audio and Speech Processing Human-Computer Interaction |
| url | https://arxiv.org/abs/2406.02569 |