Saved in:
Bibliographic Details
Main Authors: Lv, Yuanjun, Li, Hai, Yan, Ying, Liu, Junhui, Xie, Danming, Xie, Lei
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2406.08196
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866913387953258496
author Lv, Yuanjun
Li, Hai
Yan, Ying
Liu, Junhui
Xie, Danming
Xie, Lei
author_facet Lv, Yuanjun
Li, Hai
Yan, Ying
Liu, Junhui
Xie, Danming
Xie, Lei
contents Vocoders reconstruct speech waveforms from acoustic features and play a pivotal role in modern TTS systems. Frequent-domain GAN vocoders like Vocos and APNet2 have recently seen rapid advancements, outperforming time-domain models in inference speed while achieving comparable audio quality. However, these frequency-domain vocoders suffer from large parameter sizes, thus introducing extra memory burden. Inspired by PriorGrad and SpecGrad, we employ pseudo-inverse to estimate the amplitude spectrum as the initialization roughly. This simple initialization significantly mitigates the parameter demand for vocoder. Based on APNet2 and our streamlined Amplitude prediction branch, we propose our FreeV, compared with its counterpart APNet2, our FreeV achieves 1.8 times inference speed improvement with nearly half parameters. Meanwhile, our FreeV outperforms APNet2 in resynthesis quality, marking a step forward in pursuing real-time, high-fidelity speech synthesis. Code and checkpoints is available at: https://github.com/BakerBunker/FreeV
format Preprint
id arxiv_https___arxiv_org_abs_2406_08196
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle FreeV: Free Lunch For Vocoders Through Pseudo Inversed Mel Filter
Lv, Yuanjun
Li, Hai
Yan, Ying
Liu, Junhui
Xie, Danming
Xie, Lei
Sound
Audio and Speech Processing
Vocoders reconstruct speech waveforms from acoustic features and play a pivotal role in modern TTS systems. Frequent-domain GAN vocoders like Vocos and APNet2 have recently seen rapid advancements, outperforming time-domain models in inference speed while achieving comparable audio quality. However, these frequency-domain vocoders suffer from large parameter sizes, thus introducing extra memory burden. Inspired by PriorGrad and SpecGrad, we employ pseudo-inverse to estimate the amplitude spectrum as the initialization roughly. This simple initialization significantly mitigates the parameter demand for vocoder. Based on APNet2 and our streamlined Amplitude prediction branch, we propose our FreeV, compared with its counterpart APNet2, our FreeV achieves 1.8 times inference speed improvement with nearly half parameters. Meanwhile, our FreeV outperforms APNet2 in resynthesis quality, marking a step forward in pursuing real-time, high-fidelity speech synthesis. Code and checkpoints is available at: https://github.com/BakerBunker/FreeV
title FreeV: Free Lunch For Vocoders Through Pseudo Inversed Mel Filter
topic Sound
Audio and Speech Processing
url https://arxiv.org/abs/2406.08196