Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wang, Yancheng, Hanna, Osama, Xie, Ruiming, Rui, Xianfeng, Shen, Maohao, Zhang, Xuedong, Fuegen, Christian, Wu, Jilong, Paul, Debjyoti, Guo, Arthur, Lei, Zhihong, Kalinli, Ozlem, He, Qing, Yang, Yingzhen
Format:	Preprint
Published:	2026
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2602.06270
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908816773218304
author	Wang, Yancheng Hanna, Osama Xie, Ruiming Rui, Xianfeng Shen, Maohao Zhang, Xuedong Fuegen, Christian Wu, Jilong Paul, Debjyoti Guo, Arthur Lei, Zhihong Kalinli, Ozlem He, Qing Yang, Yingzhen
author_facet	Wang, Yancheng Hanna, Osama Xie, Ruiming Rui, Xianfeng Shen, Maohao Zhang, Xuedong Fuegen, Christian Wu, Jilong Paul, Debjyoti Guo, Arthur Lei, Zhihong Kalinli, Ozlem He, Qing Yang, Yingzhen
contents	Emotion recognition in speech presents a complex multimodal challenge, requiring comprehension of both linguistic content and vocal expressivity, particularly prosodic features such as fundamental frequency, intensity, and temporal dynamics. Although large language models (LLMs) have shown promise in reasoning over textual transcriptions for emotion recognition, they typically neglect fine-grained prosodic information, limiting their effectiveness and interpretability. In this work, we propose VowelPrompt, a linguistically grounded framework that augments LLM-based emotion recognition with interpretable, fine-grained vowel-level prosodic cues. Drawing on phonetic evidence that vowels serve as primary carriers of affective prosody, VowelPrompt extracts pitch-, energy-, and duration-based descriptors from time-aligned vowel segments, and converts these features into natural language descriptions for better interpretability. Such a design enables LLMs to jointly reason over lexical semantics and fine-grained prosodic variation. Moreover, we adopt a two-stage adaptation procedure comprising supervised fine-tuning (SFT) followed by Reinforcement Learning with Verifiable Reward (RLVR), implemented via Group Relative Policy Optimization (GRPO), to enhance reasoning capability, enforce structured output adherence, and improve generalization across domains and speaker variations. Extensive evaluations across diverse benchmark datasets demonstrate that VowelPrompt consistently outperforms state-of-the-art emotion recognition methods under zero-shot, fine-tuned, cross-domain, and cross-linguistic conditions, while enabling the generation of interpretable explanations that are jointly grounded in contextual semantics and fine-grained prosodic structure.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_06270
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	VowelPrompt: Hearing Speech Emotions from Text via Vowel-level Prosodic Augmentation Wang, Yancheng Hanna, Osama Xie, Ruiming Rui, Xianfeng Shen, Maohao Zhang, Xuedong Fuegen, Christian Wu, Jilong Paul, Debjyoti Guo, Arthur Lei, Zhihong Kalinli, Ozlem He, Qing Yang, Yingzhen Computation and Language Emotion recognition in speech presents a complex multimodal challenge, requiring comprehension of both linguistic content and vocal expressivity, particularly prosodic features such as fundamental frequency, intensity, and temporal dynamics. Although large language models (LLMs) have shown promise in reasoning over textual transcriptions for emotion recognition, they typically neglect fine-grained prosodic information, limiting their effectiveness and interpretability. In this work, we propose VowelPrompt, a linguistically grounded framework that augments LLM-based emotion recognition with interpretable, fine-grained vowel-level prosodic cues. Drawing on phonetic evidence that vowels serve as primary carriers of affective prosody, VowelPrompt extracts pitch-, energy-, and duration-based descriptors from time-aligned vowel segments, and converts these features into natural language descriptions for better interpretability. Such a design enables LLMs to jointly reason over lexical semantics and fine-grained prosodic variation. Moreover, we adopt a two-stage adaptation procedure comprising supervised fine-tuning (SFT) followed by Reinforcement Learning with Verifiable Reward (RLVR), implemented via Group Relative Policy Optimization (GRPO), to enhance reasoning capability, enforce structured output adherence, and improve generalization across domains and speaker variations. Extensive evaluations across diverse benchmark datasets demonstrate that VowelPrompt consistently outperforms state-of-the-art emotion recognition methods under zero-shot, fine-tuned, cross-domain, and cross-linguistic conditions, while enabling the generation of interpretable explanations that are jointly grounded in contextual semantics and fine-grained prosodic structure.
title	VowelPrompt: Hearing Speech Emotions from Text via Vowel-level Prosodic Augmentation
topic	Computation and Language
url	https://arxiv.org/abs/2602.06270

Similar Items