Saved in:
Bibliographic Details
Main Authors: Jing, Xin, Triantafyllopoulos, Andreas, Wang, Jiadong, Amiriparian, Shahin, Luo, Jun, Schuller, Björn
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2603.09820
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866908877108281344
author Jing, Xin
Triantafyllopoulos, Andreas
Wang, Jiadong
Amiriparian, Shahin
Luo, Jun
Schuller, Björn
author_facet Jing, Xin
Triantafyllopoulos, Andreas
Wang, Jiadong
Amiriparian, Shahin
Luo, Jun
Schuller, Björn
contents Recent advancements in speech captioning models have enabled the generation of rich, fine-grained captions for emotional speech. However, the evaluation of such captions remains a critical bottleneck: traditional N-gram metrics fail to capture semantic nuances, while LLM judges often suffer from reasoning inconsistency and context-collapse when processing long-form descriptions. In this work, we propose EmoSURA, a novel evaluation framework that shifts the paradigm from holistic scoring to atomic verification. EmoSURA decomposes complex captions into Atomic Perceptual Units, which are self-contained statements regarding vocal or emotional attributes, and employs an audio-grounded verification mechanism to validate each unit against the raw speech signal. Furthermore, we address the scarcity of standardized evaluation resources by introducing SURABench, a carefully balanced and stratified benchmark. Our experiments show that EmoSURA achieves a positive correlation with human judgments, offering a more reliable assessment for long-form captions compared to traditional metrics, which demonstrated negative correlations due to their sensitivity to caption length.
format Preprint
id arxiv_https___arxiv_org_abs_2603_09820
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle EmoSURA: Towards Accurate Evaluation of Detailed and Long-Context Emotional Speech Captions
Jing, Xin
Triantafyllopoulos, Andreas
Wang, Jiadong
Amiriparian, Shahin
Luo, Jun
Schuller, Björn
Sound
Recent advancements in speech captioning models have enabled the generation of rich, fine-grained captions for emotional speech. However, the evaluation of such captions remains a critical bottleneck: traditional N-gram metrics fail to capture semantic nuances, while LLM judges often suffer from reasoning inconsistency and context-collapse when processing long-form descriptions. In this work, we propose EmoSURA, a novel evaluation framework that shifts the paradigm from holistic scoring to atomic verification. EmoSURA decomposes complex captions into Atomic Perceptual Units, which are self-contained statements regarding vocal or emotional attributes, and employs an audio-grounded verification mechanism to validate each unit against the raw speech signal. Furthermore, we address the scarcity of standardized evaluation resources by introducing SURABench, a carefully balanced and stratified benchmark. Our experiments show that EmoSURA achieves a positive correlation with human judgments, offering a more reliable assessment for long-form captions compared to traditional metrics, which demonstrated negative correlations due to their sensitivity to caption length.
title EmoSURA: Towards Accurate Evaluation of Detailed and Long-Context Emotional Speech Captions
topic Sound
url https://arxiv.org/abs/2603.09820