Saved in:
Bibliographic Details
Main Authors: Gu, Yi, Liu, Yanqing, Yang, Chen, Zhao, Sheng
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2603.01565
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866914363106918400
author Gu, Yi
Liu, Yanqing
Yang, Chen
Zhao, Sheng
author_facet Gu, Yi
Liu, Yanqing
Yang, Chen
Zhao, Sheng
contents Text-to-audio (T2A) generation has advanced considerably in recent years, yet existing methods continue to face challenges in accurately rendering complex text prompts, particularly those involving intricate audio effects, and achieving precise text-audio alignment. While prior approaches have explored data augmentation, explicit timing conditioning, and reinforcement learning, overall synthesis quality remains constrained. In this work, we experiment with reinforcement learning to further enhance T2A generation quality, building on diffusion transformer (DiT)-based architectures. Our method first employs a large language model (LLM) to generate high-fidelity, richly detailed audio captions, substantially improving text-audio semantic alignment, especially for ambiguous or underspecified prompts. We then apply Group Relative Policy Optimization (GRPO), a recently introduced reinforcement learning algorithm, to fine-tune the T2A model. Through systematic experimentation with diverse reward functions (including CLAP, KL, FAD, and their combinations), we identify the key drivers of effective RL in audio synthesis and analyze how reward design impacts final audio quality. Experimental results demonstrate that GRPO-based fine-tuning yield substantial gains in synthesis fidelity and prompt adherence.
format Preprint
id arxiv_https___arxiv_org_abs_2603_01565
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Investigating Group Relative Policy Optimization for Diffusion Transformer based Text-to-Audio Generation
Gu, Yi
Liu, Yanqing
Yang, Chen
Zhao, Sheng
Audio and Speech Processing
Sound
Text-to-audio (T2A) generation has advanced considerably in recent years, yet existing methods continue to face challenges in accurately rendering complex text prompts, particularly those involving intricate audio effects, and achieving precise text-audio alignment. While prior approaches have explored data augmentation, explicit timing conditioning, and reinforcement learning, overall synthesis quality remains constrained. In this work, we experiment with reinforcement learning to further enhance T2A generation quality, building on diffusion transformer (DiT)-based architectures. Our method first employs a large language model (LLM) to generate high-fidelity, richly detailed audio captions, substantially improving text-audio semantic alignment, especially for ambiguous or underspecified prompts. We then apply Group Relative Policy Optimization (GRPO), a recently introduced reinforcement learning algorithm, to fine-tune the T2A model. Through systematic experimentation with diverse reward functions (including CLAP, KL, FAD, and their combinations), we identify the key drivers of effective RL in audio synthesis and analyze how reward design impacts final audio quality. Experimental results demonstrate that GRPO-based fine-tuning yield substantial gains in synthesis fidelity and prompt adherence.
title Investigating Group Relative Policy Optimization for Diffusion Transformer based Text-to-Audio Generation
topic Audio and Speech Processing
Sound
url https://arxiv.org/abs/2603.01565