Enregistré dans:
Détails bibliographiques
Auteurs principaux: Xiong, Chenxu, Fu, Ruibo, Shi, Shuchen, Wen, Zhengqi, Tao, Jianhua, Wang, Tao, Li, Chenxing, Qiang, Chunyu, Xie, Yuankun, Qi, Xin, Li, Guanjun, Yang, Zizheng
Format: Preprint
Publié: 2024
Sujets:
Accès en ligne:https://arxiv.org/abs/2409.09381
Tags: Ajouter un tag
Pas de tags, Soyez le premier à ajouter un tag!
_version_ 1866929500183330816
author Xiong, Chenxu
Fu, Ruibo
Shi, Shuchen
Wen, Zhengqi
Tao, Jianhua
Wang, Tao
Li, Chenxing
Qiang, Chunyu
Xie, Yuankun
Qi, Xin
Li, Guanjun
Yang, Zizheng
author_facet Xiong, Chenxu
Fu, Ruibo
Shi, Shuchen
Wen, Zhengqi
Tao, Jianhua
Wang, Tao
Li, Chenxing
Qiang, Chunyu
Xie, Yuankun
Qi, Xin
Li, Guanjun
Yang, Zizheng
contents Current mainstream audio generation methods primarily rely on simple text prompts, often failing to capture the nuanced details necessary for multi-style audio generation. To address this limitation, the Sound Event Enhanced Prompt Adapter is proposed. Unlike traditional static global style transfer, this method extracts style embedding through cross-attention between text and reference audio for adaptive style control. Adaptive layer normalization is then utilized to enhance the model's capacity to express multiple styles. Additionally, the Sound Event Reference Style Transfer Dataset (SERST) is introduced for the proposed target style audio generation task, enabling dual-prompt audio generation using both text and audio references. Experimental results demonstrate the robustness of the model, achieving state-of-the-art Fréchet Distance of 26.94 and KL Divergence of 1.82, surpassing Tango, AudioLDM, and AudioGen. Furthermore, the generated audio shows high similarity to its corresponding audio reference. The demo, code, and dataset are publicly available.
format Preprint
id arxiv_https___arxiv_org_abs_2409_09381
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Text Prompt is Not Enough: Sound Event Enhanced Prompt Adapter for Target Style Audio Generation
Xiong, Chenxu
Fu, Ruibo
Shi, Shuchen
Wen, Zhengqi
Tao, Jianhua
Wang, Tao
Li, Chenxing
Qiang, Chunyu
Xie, Yuankun
Qi, Xin
Li, Guanjun
Yang, Zizheng
Audio and Speech Processing
Artificial Intelligence
Sound
Current mainstream audio generation methods primarily rely on simple text prompts, often failing to capture the nuanced details necessary for multi-style audio generation. To address this limitation, the Sound Event Enhanced Prompt Adapter is proposed. Unlike traditional static global style transfer, this method extracts style embedding through cross-attention between text and reference audio for adaptive style control. Adaptive layer normalization is then utilized to enhance the model's capacity to express multiple styles. Additionally, the Sound Event Reference Style Transfer Dataset (SERST) is introduced for the proposed target style audio generation task, enabling dual-prompt audio generation using both text and audio references. Experimental results demonstrate the robustness of the model, achieving state-of-the-art Fréchet Distance of 26.94 and KL Divergence of 1.82, surpassing Tango, AudioLDM, and AudioGen. Furthermore, the generated audio shows high similarity to its corresponding audio reference. The demo, code, and dataset are publicly available.
title Text Prompt is Not Enough: Sound Event Enhanced Prompt Adapter for Target Style Audio Generation
topic Audio and Speech Processing
Artificial Intelligence
Sound
url https://arxiv.org/abs/2409.09381