Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Li, Andong, Lei, Tong, Chen, Rilin, Li, Kai, Yu, Meng, Li, Xiaodong, Yu, Dong, Zheng, Chengshi
Format:	Preprint
Published:	2025
Subjects:	Sound
Online Access:	https://arxiv.org/abs/2511.07116
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917071013543936
author	Li, Andong Lei, Tong Chen, Rilin Li, Kai Yu, Meng Li, Xiaodong Yu, Dong Zheng, Chengshi
author_facet	Li, Andong Lei, Tong Chen, Rilin Li, Kai Yu, Meng Li, Xiaodong Yu, Dong Zheng, Chengshi
contents	This paper revisits the neural vocoder task through the lens of audio restoration and propose a novel diffusion vocoder called BridgeVoC. Specifically, by rank analysis, we compare the rank characteristics of Mel-spectrum with other common acoustic degradation factors, and cast the vocoder task as a specialized case of audio restoration, where the range-space spectral (RSS) surrogate of the target spectrum acts as the degraded input. Based on that, we introduce the Schrodinger bridge framework for diffusion modeling, which defines the RSS and target spectrum as dual endpoints of the stochastic generation trajectory. Further, to fully utilize the hierarchical prior of subbands in the time-frequency (T-F) domain, we elaborately devise a novel subband-aware convolutional diffusion network as the data predictor, where subbands are divided following an uneven strategy, and convolutional-style attention module is employed with large kernels for efficient T-F contextual modeling. To enable single-step inference, we propose an omnidirectional distillation loss to facilitate effective information transfer from the teacher model to the student model, and the performance is improved by combining target-related and bijective consistency losses. Comprehensive experiments are conducted on various benchmarks and out-of-distribution datasets. Quantitative and qualitative results show that while enjoying fewer parameters, lower computational cost, and competitive inference speed, the proposed BridgeVoC yields stateof-the-art performance over existing advanced GAN-, DDPMand flow-matching-based baselines with only 4 sampling steps. And consistent superiority is still achieved with single-step inference.
format	Preprint
id	arxiv_https___arxiv_org_abs_2511_07116
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	BridgeVoC: Revitalizing Neural Vocoder from a Restoration Perspective Li, Andong Lei, Tong Chen, Rilin Li, Kai Yu, Meng Li, Xiaodong Yu, Dong Zheng, Chengshi Sound This paper revisits the neural vocoder task through the lens of audio restoration and propose a novel diffusion vocoder called BridgeVoC. Specifically, by rank analysis, we compare the rank characteristics of Mel-spectrum with other common acoustic degradation factors, and cast the vocoder task as a specialized case of audio restoration, where the range-space spectral (RSS) surrogate of the target spectrum acts as the degraded input. Based on that, we introduce the Schrodinger bridge framework for diffusion modeling, which defines the RSS and target spectrum as dual endpoints of the stochastic generation trajectory. Further, to fully utilize the hierarchical prior of subbands in the time-frequency (T-F) domain, we elaborately devise a novel subband-aware convolutional diffusion network as the data predictor, where subbands are divided following an uneven strategy, and convolutional-style attention module is employed with large kernels for efficient T-F contextual modeling. To enable single-step inference, we propose an omnidirectional distillation loss to facilitate effective information transfer from the teacher model to the student model, and the performance is improved by combining target-related and bijective consistency losses. Comprehensive experiments are conducted on various benchmarks and out-of-distribution datasets. Quantitative and qualitative results show that while enjoying fewer parameters, lower computational cost, and competitive inference speed, the proposed BridgeVoC yields stateof-the-art performance over existing advanced GAN-, DDPMand flow-matching-based baselines with only 4 sampling steps. And consistent superiority is still achieved with single-step inference.
title	BridgeVoC: Revitalizing Neural Vocoder from a Restoration Perspective
topic	Sound
url	https://arxiv.org/abs/2511.07116

Similar Items