Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2603.08046 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866914379753062400 |
|---|---|
| author | Fang, Zihao Shen, Yingda Guan, Zifan Song, Tongtong Liu, Zhenyi Wu, Zhizheng |
| author_facet | Fang, Zihao Shen, Yingda Guan, Zifan Song, Tongtong Liu, Zhenyi Wu, Zhizheng |
| contents | Whispered speech lacks vocal fold vibration and fundamental frequency, resulting in degraded acoustic cues and making whisper-to-normal (W2N) conversion challenging, especially with limited parallel data. We propose WhispEar, a bidirectional framework based on unified semantic representations that capture speaking-mode-invariant information shared by whispered and normal speech. The framework contains both W2N and normal-to-whisper (N2W) models. Notably, the N2W model enables zero-shot pseudo-parallel whisper generation from abundant normal speech, allowing scalable data augmentation for W2N training. Increasing generated data consistently improves performance. We also release the largest bilingual (Chinese-English) whispered-normal parallel corpus to date. Experiments demonstrate that WhispEar outperforms strong baselines and benefits significantly from scalable pseudo-parallel data. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2603_08046 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | WhispEar: A Bi-directional Framework for Scaling Whispered Speech Conversion via Pseudo-Parallel Whisper Generation Fang, Zihao Shen, Yingda Guan, Zifan Song, Tongtong Liu, Zhenyi Wu, Zhizheng Sound Audio and Speech Processing Whispered speech lacks vocal fold vibration and fundamental frequency, resulting in degraded acoustic cues and making whisper-to-normal (W2N) conversion challenging, especially with limited parallel data. We propose WhispEar, a bidirectional framework based on unified semantic representations that capture speaking-mode-invariant information shared by whispered and normal speech. The framework contains both W2N and normal-to-whisper (N2W) models. Notably, the N2W model enables zero-shot pseudo-parallel whisper generation from abundant normal speech, allowing scalable data augmentation for W2N training. Increasing generated data consistently improves performance. We also release the largest bilingual (Chinese-English) whispered-normal parallel corpus to date. Experiments demonstrate that WhispEar outperforms strong baselines and benefits significantly from scalable pseudo-parallel data. |
| title | WhispEar: A Bi-directional Framework for Scaling Whispered Speech Conversion via Pseudo-Parallel Whisper Generation |
| topic | Sound Audio and Speech Processing |
| url | https://arxiv.org/abs/2603.08046 |