Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2603.09820 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866908877108281344 |
|---|---|
| author | Jing, Xin Triantafyllopoulos, Andreas Wang, Jiadong Amiriparian, Shahin Luo, Jun Schuller, Björn |
| author_facet | Jing, Xin Triantafyllopoulos, Andreas Wang, Jiadong Amiriparian, Shahin Luo, Jun Schuller, Björn |
| contents | Recent advancements in speech captioning models have enabled the generation of rich, fine-grained captions for emotional speech. However, the evaluation of such captions remains a critical bottleneck: traditional N-gram metrics fail to capture semantic nuances, while LLM judges often suffer from reasoning inconsistency and context-collapse when processing long-form descriptions. In this work, we propose EmoSURA, a novel evaluation framework that shifts the paradigm from holistic scoring to atomic verification. EmoSURA decomposes complex captions into Atomic Perceptual Units, which are self-contained statements regarding vocal or emotional attributes, and employs an audio-grounded verification mechanism to validate each unit against the raw speech signal. Furthermore, we address the scarcity of standardized evaluation resources by introducing SURABench, a carefully balanced and stratified benchmark. Our experiments show that EmoSURA achieves a positive correlation with human judgments, offering a more reliable assessment for long-form captions compared to traditional metrics, which demonstrated negative correlations due to their sensitivity to caption length. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2603_09820 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | EmoSURA: Towards Accurate Evaluation of Detailed and Long-Context Emotional Speech Captions Jing, Xin Triantafyllopoulos, Andreas Wang, Jiadong Amiriparian, Shahin Luo, Jun Schuller, Björn Sound Recent advancements in speech captioning models have enabled the generation of rich, fine-grained captions for emotional speech. However, the evaluation of such captions remains a critical bottleneck: traditional N-gram metrics fail to capture semantic nuances, while LLM judges often suffer from reasoning inconsistency and context-collapse when processing long-form descriptions. In this work, we propose EmoSURA, a novel evaluation framework that shifts the paradigm from holistic scoring to atomic verification. EmoSURA decomposes complex captions into Atomic Perceptual Units, which are self-contained statements regarding vocal or emotional attributes, and employs an audio-grounded verification mechanism to validate each unit against the raw speech signal. Furthermore, we address the scarcity of standardized evaluation resources by introducing SURABench, a carefully balanced and stratified benchmark. Our experiments show that EmoSURA achieves a positive correlation with human judgments, offering a more reliable assessment for long-form captions compared to traditional metrics, which demonstrated negative correlations due to their sensitivity to caption length. |
| title | EmoSURA: Towards Accurate Evaluation of Detailed and Long-Context Emotional Speech Captions |
| topic | Sound |
| url | https://arxiv.org/abs/2603.09820 |