Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Chen, Huakang, Cheng, Wenkai, Ma, Guobin, Hao, Chunbo, Xia, Yuxuan, Wei, Mengqi, Zhao, Zhixian, Zhu, Pengcheng, Zhang, Hanbing, Xie, Lei
Format:	Preprint
Published:	2026
Subjects:	Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2605.17414
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910229109669888
author	Chen, Huakang Cheng, Wenkai Ma, Guobin Hao, Chunbo Xia, Yuxuan Wei, Mengqi Zhao, Zhixian Zhu, Pengcheng Zhang, Hanbing Xie, Lei
author_facet	Chen, Huakang Cheng, Wenkai Ma, Guobin Hao, Chunbo Xia, Yuxuan Wei, Mengqi Zhao, Zhixian Zhu, Pengcheng Zhang, Hanbing Xie, Lei
contents	High-fidelity text-to-music generation typically relies on massive proprietary datasets and immense computational resources. Existing models often struggle to generate coherent pure musical accompaniments and lack precise, localized semantic control due to their reliance on coarse, track-level annotations. To address these limitations under constrained data and computing resources, we propose S2Accompanist, a Semantic-Aware and Structure-Guided Diffusion Model developed for the ICME2026 ATTM Grand Challenge. Specifically, we design an automated data pipeline comprising structural segmentation, Large Audio-Language Model driven segment-level captioning, and dual-metric quality grading to overcome the absence of localized metadata in raw datasets. Furthermore, we propose a semantic-aware Variational Autoencoder fine-tuning strategy that explicitly distills foundational LeadSheet structures into the acoustic latent space, effectively improving the overall audio fidelity. Extensive experiments demonstrate that S2Accompanist achieves state-of-the-art objective performance on the ATTM Grand Challenge benchmark across both the Efficiency and Performance Tracks. With only 402M parameters, our model remains competitive compared to larger-scale unconstrained models and secured first place in the Efficiency Track.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_17414
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	S2Accompanist: A Semantic-Aware and Structure-Guided Diffusion Model for Music Accompaniment Generation Chen, Huakang Cheng, Wenkai Ma, Guobin Hao, Chunbo Xia, Yuxuan Wei, Mengqi Zhao, Zhixian Zhu, Pengcheng Zhang, Hanbing Xie, Lei Audio and Speech Processing High-fidelity text-to-music generation typically relies on massive proprietary datasets and immense computational resources. Existing models often struggle to generate coherent pure musical accompaniments and lack precise, localized semantic control due to their reliance on coarse, track-level annotations. To address these limitations under constrained data and computing resources, we propose S2Accompanist, a Semantic-Aware and Structure-Guided Diffusion Model developed for the ICME2026 ATTM Grand Challenge. Specifically, we design an automated data pipeline comprising structural segmentation, Large Audio-Language Model driven segment-level captioning, and dual-metric quality grading to overcome the absence of localized metadata in raw datasets. Furthermore, we propose a semantic-aware Variational Autoencoder fine-tuning strategy that explicitly distills foundational LeadSheet structures into the acoustic latent space, effectively improving the overall audio fidelity. Extensive experiments demonstrate that S2Accompanist achieves state-of-the-art objective performance on the ATTM Grand Challenge benchmark across both the Efficiency and Performance Tracks. With only 402M parameters, our model remains competitive compared to larger-scale unconstrained models and secured first place in the Efficiency Track.
title	S2Accompanist: A Semantic-Aware and Structure-Guided Diffusion Model for Music Accompaniment Generation
topic	Audio and Speech Processing
url	https://arxiv.org/abs/2605.17414

Similar Items