Saved in:
Bibliographic Details
Main Authors: Fang, Zihao, Shen, Yingda, Guan, Zifan, Song, Tongtong, Liu, Zhenyi, Wu, Zhizheng
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2603.08046
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866914379753062400
author Fang, Zihao
Shen, Yingda
Guan, Zifan
Song, Tongtong
Liu, Zhenyi
Wu, Zhizheng
author_facet Fang, Zihao
Shen, Yingda
Guan, Zifan
Song, Tongtong
Liu, Zhenyi
Wu, Zhizheng
contents Whispered speech lacks vocal fold vibration and fundamental frequency, resulting in degraded acoustic cues and making whisper-to-normal (W2N) conversion challenging, especially with limited parallel data. We propose WhispEar, a bidirectional framework based on unified semantic representations that capture speaking-mode-invariant information shared by whispered and normal speech. The framework contains both W2N and normal-to-whisper (N2W) models. Notably, the N2W model enables zero-shot pseudo-parallel whisper generation from abundant normal speech, allowing scalable data augmentation for W2N training. Increasing generated data consistently improves performance. We also release the largest bilingual (Chinese-English) whispered-normal parallel corpus to date. Experiments demonstrate that WhispEar outperforms strong baselines and benefits significantly from scalable pseudo-parallel data.
format Preprint
id arxiv_https___arxiv_org_abs_2603_08046
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle WhispEar: A Bi-directional Framework for Scaling Whispered Speech Conversion via Pseudo-Parallel Whisper Generation
Fang, Zihao
Shen, Yingda
Guan, Zifan
Song, Tongtong
Liu, Zhenyi
Wu, Zhizheng
Sound
Audio and Speech Processing
Whispered speech lacks vocal fold vibration and fundamental frequency, resulting in degraded acoustic cues and making whisper-to-normal (W2N) conversion challenging, especially with limited parallel data. We propose WhispEar, a bidirectional framework based on unified semantic representations that capture speaking-mode-invariant information shared by whispered and normal speech. The framework contains both W2N and normal-to-whisper (N2W) models. Notably, the N2W model enables zero-shot pseudo-parallel whisper generation from abundant normal speech, allowing scalable data augmentation for W2N training. Increasing generated data consistently improves performance. We also release the largest bilingual (Chinese-English) whispered-normal parallel corpus to date. Experiments demonstrate that WhispEar outperforms strong baselines and benefits significantly from scalable pseudo-parallel data.
title WhispEar: A Bi-directional Framework for Scaling Whispered Speech Conversion via Pseudo-Parallel Whisper Generation
topic Sound
Audio and Speech Processing
url https://arxiv.org/abs/2603.08046