Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Lv, Yuanjun, Li, Hai, Yan, Ying, Liu, Junhui, Xie, Danming, Xie, Lei
Format:	Preprint
Published:	2024
Subjects:	Sound Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2406.08196
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913387953258496
author	Lv, Yuanjun Li, Hai Yan, Ying Liu, Junhui Xie, Danming Xie, Lei
author_facet	Lv, Yuanjun Li, Hai Yan, Ying Liu, Junhui Xie, Danming Xie, Lei
contents	Vocoders reconstruct speech waveforms from acoustic features and play a pivotal role in modern TTS systems. Frequent-domain GAN vocoders like Vocos and APNet2 have recently seen rapid advancements, outperforming time-domain models in inference speed while achieving comparable audio quality. However, these frequency-domain vocoders suffer from large parameter sizes, thus introducing extra memory burden. Inspired by PriorGrad and SpecGrad, we employ pseudo-inverse to estimate the amplitude spectrum as the initialization roughly. This simple initialization significantly mitigates the parameter demand for vocoder. Based on APNet2 and our streamlined Amplitude prediction branch, we propose our FreeV, compared with its counterpart APNet2, our FreeV achieves 1.8 times inference speed improvement with nearly half parameters. Meanwhile, our FreeV outperforms APNet2 in resynthesis quality, marking a step forward in pursuing real-time, high-fidelity speech synthesis. Code and checkpoints is available at: https://github.com/BakerBunker/FreeV
format	Preprint
id	arxiv_https___arxiv_org_abs_2406_08196
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	FreeV: Free Lunch For Vocoders Through Pseudo Inversed Mel Filter Lv, Yuanjun Li, Hai Yan, Ying Liu, Junhui Xie, Danming Xie, Lei Sound Audio and Speech Processing Vocoders reconstruct speech waveforms from acoustic features and play a pivotal role in modern TTS systems. Frequent-domain GAN vocoders like Vocos and APNet2 have recently seen rapid advancements, outperforming time-domain models in inference speed while achieving comparable audio quality. However, these frequency-domain vocoders suffer from large parameter sizes, thus introducing extra memory burden. Inspired by PriorGrad and SpecGrad, we employ pseudo-inverse to estimate the amplitude spectrum as the initialization roughly. This simple initialization significantly mitigates the parameter demand for vocoder. Based on APNet2 and our streamlined Amplitude prediction branch, we propose our FreeV, compared with its counterpart APNet2, our FreeV achieves 1.8 times inference speed improvement with nearly half parameters. Meanwhile, our FreeV outperforms APNet2 in resynthesis quality, marking a step forward in pursuing real-time, high-fidelity speech synthesis. Code and checkpoints is available at: https://github.com/BakerBunker/FreeV
title	FreeV: Free Lunch For Vocoders Through Pseudo Inversed Mel Filter
topic	Sound Audio and Speech Processing
url	https://arxiv.org/abs/2406.08196

Similar Items