Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Gu, Yi, Liu, Yanqing, Yang, Chen, Zhao, Sheng
Format:	Preprint
Published:	2026
Subjects:	Audio and Speech Processing Sound
Online Access:	https://arxiv.org/abs/2603.01565
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914363106918400
author	Gu, Yi Liu, Yanqing Yang, Chen Zhao, Sheng
author_facet	Gu, Yi Liu, Yanqing Yang, Chen Zhao, Sheng
contents	Text-to-audio (T2A) generation has advanced considerably in recent years, yet existing methods continue to face challenges in accurately rendering complex text prompts, particularly those involving intricate audio effects, and achieving precise text-audio alignment. While prior approaches have explored data augmentation, explicit timing conditioning, and reinforcement learning, overall synthesis quality remains constrained. In this work, we experiment with reinforcement learning to further enhance T2A generation quality, building on diffusion transformer (DiT)-based architectures. Our method first employs a large language model (LLM) to generate high-fidelity, richly detailed audio captions, substantially improving text-audio semantic alignment, especially for ambiguous or underspecified prompts. We then apply Group Relative Policy Optimization (GRPO), a recently introduced reinforcement learning algorithm, to fine-tune the T2A model. Through systematic experimentation with diverse reward functions (including CLAP, KL, FAD, and their combinations), we identify the key drivers of effective RL in audio synthesis and analyze how reward design impacts final audio quality. Experimental results demonstrate that GRPO-based fine-tuning yield substantial gains in synthesis fidelity and prompt adherence.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_01565
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Investigating Group Relative Policy Optimization for Diffusion Transformer based Text-to-Audio Generation Gu, Yi Liu, Yanqing Yang, Chen Zhao, Sheng Audio and Speech Processing Sound Text-to-audio (T2A) generation has advanced considerably in recent years, yet existing methods continue to face challenges in accurately rendering complex text prompts, particularly those involving intricate audio effects, and achieving precise text-audio alignment. While prior approaches have explored data augmentation, explicit timing conditioning, and reinforcement learning, overall synthesis quality remains constrained. In this work, we experiment with reinforcement learning to further enhance T2A generation quality, building on diffusion transformer (DiT)-based architectures. Our method first employs a large language model (LLM) to generate high-fidelity, richly detailed audio captions, substantially improving text-audio semantic alignment, especially for ambiguous or underspecified prompts. We then apply Group Relative Policy Optimization (GRPO), a recently introduced reinforcement learning algorithm, to fine-tune the T2A model. Through systematic experimentation with diverse reward functions (including CLAP, KL, FAD, and their combinations), we identify the key drivers of effective RL in audio synthesis and analyze how reward design impacts final audio quality. Experimental results demonstrate that GRPO-based fine-tuning yield substantial gains in synthesis fidelity and prompt adherence.
title	Investigating Group Relative Policy Optimization for Diffusion Transformer based Text-to-Audio Generation
topic	Audio and Speech Processing Sound
url	https://arxiv.org/abs/2603.01565

Similar Items