Saved in:
Bibliographic Details
Main Authors: Pang, Zi Haur, Fu, Yahui, Gao, Yuan, Kawahara, Tatsuya
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2603.09307
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866912958477500416
author Pang, Zi Haur
Fu, Yahui
Gao, Yuan
Kawahara, Tatsuya
author_facet Pang, Zi Haur
Fu, Yahui
Gao, Yuan
Kawahara, Tatsuya
contents Emotional Validation is a psychotherapy communication technique that involves recognizing, understanding, and explicitly acknowledging another person's feelings and actions, which strengthens alliance and reduces negative affect. To maximize the emotional support provided by validation, it is crucial to deliver it with appropriate timing and frequency. This study investigates validation timing detection from the speech perspective. Leveraging both paralinguistic and emotional information, we propose a paralinguistic- and emotion-aware model for validation timing detection without relying on textual context. Specifically, we first conduct continued self-supervised training and fine-tuning on different HuBERT backbones to obtain (i) a paralinguistics-aware Self-Supervised Learning (SSL) encoder and (ii) a multi-task speech emotion classification encoder. We then fuse these encoders and further fine-tune the combined model on the downstream validation timing detection task. Experimental evaluations on the TUT Emotional Storytelling Corpus (TESC) compare multiple models, fusion mechanisms, and training strategies, and demonstrate that the proposed approach achieves significant improvements over conventional speech baselines. Our results indicate that non-linguistic speech cues, when integrated with affect-related representations, carry sufficient signal to decide when validation should be expressed, offering a speech-first pathway toward more empathetic human-robot interaction.
format Preprint
id arxiv_https___arxiv_org_abs_2603_09307
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Paralinguistic Emotion-Aware Validation Timing Detection in Japanese Empathetic Spoken Dialogue
Pang, Zi Haur
Fu, Yahui
Gao, Yuan
Kawahara, Tatsuya
Sound
Emotional Validation is a psychotherapy communication technique that involves recognizing, understanding, and explicitly acknowledging another person's feelings and actions, which strengthens alliance and reduces negative affect. To maximize the emotional support provided by validation, it is crucial to deliver it with appropriate timing and frequency. This study investigates validation timing detection from the speech perspective. Leveraging both paralinguistic and emotional information, we propose a paralinguistic- and emotion-aware model for validation timing detection without relying on textual context. Specifically, we first conduct continued self-supervised training and fine-tuning on different HuBERT backbones to obtain (i) a paralinguistics-aware Self-Supervised Learning (SSL) encoder and (ii) a multi-task speech emotion classification encoder. We then fuse these encoders and further fine-tune the combined model on the downstream validation timing detection task. Experimental evaluations on the TUT Emotional Storytelling Corpus (TESC) compare multiple models, fusion mechanisms, and training strategies, and demonstrate that the proposed approach achieves significant improvements over conventional speech baselines. Our results indicate that non-linguistic speech cues, when integrated with affect-related representations, carry sufficient signal to decide when validation should be expressed, offering a speech-first pathway toward more empathetic human-robot interaction.
title Paralinguistic Emotion-Aware Validation Timing Detection in Japanese Empathetic Spoken Dialogue
topic Sound
url https://arxiv.org/abs/2603.09307