Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Fang, Zihao, Shen, Yingda, Guan, Zifan, Song, Tongtong, Liu, Zhenyi, Wu, Zhizheng
Format:	Preprint
Published:	2026
Subjects:	Sound Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2603.08046
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914379753062400
author	Fang, Zihao Shen, Yingda Guan, Zifan Song, Tongtong Liu, Zhenyi Wu, Zhizheng
author_facet	Fang, Zihao Shen, Yingda Guan, Zifan Song, Tongtong Liu, Zhenyi Wu, Zhizheng
contents	Whispered speech lacks vocal fold vibration and fundamental frequency, resulting in degraded acoustic cues and making whisper-to-normal (W2N) conversion challenging, especially with limited parallel data. We propose WhispEar, a bidirectional framework based on unified semantic representations that capture speaking-mode-invariant information shared by whispered and normal speech. The framework contains both W2N and normal-to-whisper (N2W) models. Notably, the N2W model enables zero-shot pseudo-parallel whisper generation from abundant normal speech, allowing scalable data augmentation for W2N training. Increasing generated data consistently improves performance. We also release the largest bilingual (Chinese-English) whispered-normal parallel corpus to date. Experiments demonstrate that WhispEar outperforms strong baselines and benefits significantly from scalable pseudo-parallel data.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_08046
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	WhispEar: A Bi-directional Framework for Scaling Whispered Speech Conversion via Pseudo-Parallel Whisper Generation Fang, Zihao Shen, Yingda Guan, Zifan Song, Tongtong Liu, Zhenyi Wu, Zhizheng Sound Audio and Speech Processing Whispered speech lacks vocal fold vibration and fundamental frequency, resulting in degraded acoustic cues and making whisper-to-normal (W2N) conversion challenging, especially with limited parallel data. We propose WhispEar, a bidirectional framework based on unified semantic representations that capture speaking-mode-invariant information shared by whispered and normal speech. The framework contains both W2N and normal-to-whisper (N2W) models. Notably, the N2W model enables zero-shot pseudo-parallel whisper generation from abundant normal speech, allowing scalable data augmentation for W2N training. Increasing generated data consistently improves performance. We also release the largest bilingual (Chinese-English) whispered-normal parallel corpus to date. Experiments demonstrate that WhispEar outperforms strong baselines and benefits significantly from scalable pseudo-parallel data.
title	WhispEar: A Bi-directional Framework for Scaling Whispered Speech Conversion via Pseudo-Parallel Whisper Generation
topic	Sound Audio and Speech Processing
url	https://arxiv.org/abs/2603.08046

Similar Items