Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhang, Huan, Maezawa, Akira, Dixon, Simon
Format:	Preprint
Published:	2025
Subjects:	Audio and Speech Processing Multimedia
Online Access:	https://arxiv.org/abs/2502.07711
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917918929846272
author	Zhang, Huan Maezawa, Akira Dixon, Simon
author_facet	Zhang, Huan Maezawa, Akira Dixon, Simon
contents	Expressive music performance rendering involves interpreting symbolic scores with variations in timing, dynamics, articulation, and instrument-specific techniques, resulting in performances that capture musical can emotional intent. We introduce RenderBox, a unified framework for text-and-score controlled audio performance generation across multiple instruments, applying coarse-level controls through natural language descriptions and granular-level controls using music scores. Based on a diffusion transformer architecture and cross-attention joint conditioning, we propose a curriculum-based paradigm that trains from plain synthesis to expressive performance, gradually incorporating controllable factors such as speed, mistakes, and style diversity. RenderBox achieves high performance compared to baseline models across key metrics such as FAD and CLAP, and also tempo and pitch accuracy under different prompting tasks. Subjective evaluation further demonstrates that RenderBox is able to generate controllable expressive performances that sound natural and musically engaging, aligning well with prompts and intent.
format	Preprint
id	arxiv_https___arxiv_org_abs_2502_07711
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	RenderBox: Expressive Performance Rendering with Text Control Zhang, Huan Maezawa, Akira Dixon, Simon Audio and Speech Processing Multimedia Expressive music performance rendering involves interpreting symbolic scores with variations in timing, dynamics, articulation, and instrument-specific techniques, resulting in performances that capture musical can emotional intent. We introduce RenderBox, a unified framework for text-and-score controlled audio performance generation across multiple instruments, applying coarse-level controls through natural language descriptions and granular-level controls using music scores. Based on a diffusion transformer architecture and cross-attention joint conditioning, we propose a curriculum-based paradigm that trains from plain synthesis to expressive performance, gradually incorporating controllable factors such as speed, mistakes, and style diversity. RenderBox achieves high performance compared to baseline models across key metrics such as FAD and CLAP, and also tempo and pitch accuracy under different prompting tasks. Subjective evaluation further demonstrates that RenderBox is able to generate controllable expressive performances that sound natural and musically engaging, aligning well with prompts and intent.
title	RenderBox: Expressive Performance Rendering with Text Control
topic	Audio and Speech Processing Multimedia
url	https://arxiv.org/abs/2502.07711

Similar Items