Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Khandelwal, Tanmay, Fuentes, Magdalena
Format:	Preprint
Published:	2025
Subjects:	Audio and Speech Processing Sound
Online Access:	https://arxiv.org/abs/2510.00313
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915527606140928
author	Khandelwal, Tanmay Fuentes, Magdalena
author_facet	Khandelwal, Tanmay Fuentes, Magdalena
contents	Diffusion Transformers (DiTs) enable high-quality audio synthesis but are often computationally intensive and require substantial storage, which limits their practical deployment. In this paper, we present a comprehensive evaluation of post-training quantization (PTQ) techniques for audio DiTs, analyzing the trade-offs between static and dynamic quantization schemes. We explore two practical extensions (1) a denoising-timestep-aware smoothing method that adapts quantization scales per-input-channel and timestep to mitigate activation outliers, and (2) a lightweight low-rank adapter (LoRA)-based branch derived from singular value decomposition (SVD) to compensate for residual weight errors. Using Stable Audio Open we benchmark W8A8 and W4A8 configurations across objective metrics and human perceptual ratings. Our results show that dynamic quantization preserves fidelity even at lower precision, while static methods remain competitive with lower latency. Overall, our findings show that low-precision DiTs can retain high-fidelity generation while reducing memory usage by up to 79%.
format	Preprint
id	arxiv_https___arxiv_org_abs_2510_00313
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Post-Training Quantization for Audio Diffusion Transformers Khandelwal, Tanmay Fuentes, Magdalena Audio and Speech Processing Sound Diffusion Transformers (DiTs) enable high-quality audio synthesis but are often computationally intensive and require substantial storage, which limits their practical deployment. In this paper, we present a comprehensive evaluation of post-training quantization (PTQ) techniques for audio DiTs, analyzing the trade-offs between static and dynamic quantization schemes. We explore two practical extensions (1) a denoising-timestep-aware smoothing method that adapts quantization scales per-input-channel and timestep to mitigate activation outliers, and (2) a lightweight low-rank adapter (LoRA)-based branch derived from singular value decomposition (SVD) to compensate for residual weight errors. Using Stable Audio Open we benchmark W8A8 and W4A8 configurations across objective metrics and human perceptual ratings. Our results show that dynamic quantization preserves fidelity even at lower precision, while static methods remain competitive with lower latency. Overall, our findings show that low-precision DiTs can retain high-fidelity generation while reducing memory usage by up to 79%.
title	Post-Training Quantization for Audio Diffusion Transformers
topic	Audio and Speech Processing Sound
url	https://arxiv.org/abs/2510.00313

Similar Items