Saved in:
| Main Author: | |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2512.22166 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866908733760602112 |
|---|---|
| author | Chung, HaeChun |
| author_facet | Chung, HaeChun |
| contents | Text-to-audio (TTA) generation can significantly benefit the media industry by reducing production costs and enhancing work efficiency. However, most current TTA models (primarily diffusion-based) suffer from slow inference speeds and high computational costs. In this paper, we introduce AudioGAN, the first successful Generative Adversarial Networks (GANs)-based TTA framework that generates audio in a single pass, thereby reducing model complexity and inference time. To overcome the inherent difficulties in training GANs, we integrate multiple ,contrastive losses and propose innovative components Single-Double-Triple (SDT) Attention and Time-Frequency Cross-Attention (TF-CA). Extensive experiments on the AudioCaps dataset demonstrate that AudioGAN achieves state-of-the-art performance while using 90% fewer parameters and running 20 times faster, synthesizing audio in under one second. These results establish AudioGAN as a practical and powerful solution for real-time TTA. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2512_22166 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | AudioGAN: A Compact and Efficient Framework for Real-Time High-Fidelity Text-to-Audio Generation Chung, HaeChun Sound Audio and Speech Processing Text-to-audio (TTA) generation can significantly benefit the media industry by reducing production costs and enhancing work efficiency. However, most current TTA models (primarily diffusion-based) suffer from slow inference speeds and high computational costs. In this paper, we introduce AudioGAN, the first successful Generative Adversarial Networks (GANs)-based TTA framework that generates audio in a single pass, thereby reducing model complexity and inference time. To overcome the inherent difficulties in training GANs, we integrate multiple ,contrastive losses and propose innovative components Single-Double-Triple (SDT) Attention and Time-Frequency Cross-Attention (TF-CA). Extensive experiments on the AudioCaps dataset demonstrate that AudioGAN achieves state-of-the-art performance while using 90% fewer parameters and running 20 times faster, synthesizing audio in under one second. These results establish AudioGAN as a practical and powerful solution for real-time TTA. |
| title | AudioGAN: A Compact and Efficient Framework for Real-Time High-Fidelity Text-to-Audio Generation |
| topic | Sound Audio and Speech Processing |
| url | https://arxiv.org/abs/2512.22166 |