Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Hao, Chunbo, Yuan, Ruibin, Yao, Jixun, Deng, Qixin, Bai, Xinyi, Wang, Yanbo, Xue, Wei, Xie, Lei
Format:	Preprint
Published:	2025
Subjects:	Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2510.02797
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913014426370048
author	Hao, Chunbo Yuan, Ruibin Yao, Jixun Deng, Qixin Bai, Xinyi Wang, Yanbo Xue, Wei Xie, Lei
author_facet	Hao, Chunbo Yuan, Ruibin Yao, Jixun Deng, Qixin Bai, Xinyi Wang, Yanbo Xue, Wei Xie, Lei
contents	Music structure analysis (MSA) underpins music understanding and controllable generation, yet progress has been limited by small, inconsistent corpora. We present SongFormer, a scalable framework that learns from heterogeneous supervision. SongFormer (i) fuses short- and long-window self-supervised learning representations to capture both fine-grained and long-range dependencies, and (ii) introduces a learned source embedding to enable training with partial, noisy, and schema-mismatched labels. To support scaling and fair evaluation, we release SongFormDB, the largest MSA corpus to date (over 14k songs spanning languages and genres), and SongFormBench, a 300-song expert-verified benchmark. On SongFormBench, SongFormer sets a new state of the art in strict boundary detection (HR.5F) and achieves the highest functional label accuracy, while remaining computationally efficient; it surpasses strong baselines and Gemini 2.5 Pro on these metrics and remains competitive under relaxed tolerance (HR3F). Code, datasets, and model are open-sourced at https://github.com/ASLP-lab/SongFormer.
format	Preprint
id	arxiv_https___arxiv_org_abs_2510_02797
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	SongFormer: Scaling Music Structure Analysis with Heterogeneous Supervision Hao, Chunbo Yuan, Ruibin Yao, Jixun Deng, Qixin Bai, Xinyi Wang, Yanbo Xue, Wei Xie, Lei Audio and Speech Processing Music structure analysis (MSA) underpins music understanding and controllable generation, yet progress has been limited by small, inconsistent corpora. We present SongFormer, a scalable framework that learns from heterogeneous supervision. SongFormer (i) fuses short- and long-window self-supervised learning representations to capture both fine-grained and long-range dependencies, and (ii) introduces a learned source embedding to enable training with partial, noisy, and schema-mismatched labels. To support scaling and fair evaluation, we release SongFormDB, the largest MSA corpus to date (over 14k songs spanning languages and genres), and SongFormBench, a 300-song expert-verified benchmark. On SongFormBench, SongFormer sets a new state of the art in strict boundary detection (HR.5F) and achieves the highest functional label accuracy, while remaining computationally efficient; it surpasses strong baselines and Gemini 2.5 Pro on these metrics and remains competitive under relaxed tolerance (HR3F). Code, datasets, and model are open-sourced at https://github.com/ASLP-lab/SongFormer.
title	SongFormer: Scaling Music Structure Analysis with Heterogeneous Supervision
topic	Audio and Speech Processing
url	https://arxiv.org/abs/2510.02797

Similar Items