Saved in:
| Main Authors: | , , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2510.02797 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866913014426370048 |
|---|---|
| author | Hao, Chunbo Yuan, Ruibin Yao, Jixun Deng, Qixin Bai, Xinyi Wang, Yanbo Xue, Wei Xie, Lei |
| author_facet | Hao, Chunbo Yuan, Ruibin Yao, Jixun Deng, Qixin Bai, Xinyi Wang, Yanbo Xue, Wei Xie, Lei |
| contents | Music structure analysis (MSA) underpins music understanding and controllable generation, yet progress has been limited by small, inconsistent corpora. We present SongFormer, a scalable framework that learns from heterogeneous supervision. SongFormer (i) fuses short- and long-window self-supervised learning representations to capture both fine-grained and long-range dependencies, and (ii) introduces a learned source embedding to enable training with partial, noisy, and schema-mismatched labels. To support scaling and fair evaluation, we release SongFormDB, the largest MSA corpus to date (over 14k songs spanning languages and genres), and SongFormBench, a 300-song expert-verified benchmark. On SongFormBench, SongFormer sets a new state of the art in strict boundary detection (HR.5F) and achieves the highest functional label accuracy, while remaining computationally efficient; it surpasses strong baselines and Gemini 2.5 Pro on these metrics and remains competitive under relaxed tolerance (HR3F). Code, datasets, and model are open-sourced at https://github.com/ASLP-lab/SongFormer. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2510_02797 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | SongFormer: Scaling Music Structure Analysis with Heterogeneous Supervision Hao, Chunbo Yuan, Ruibin Yao, Jixun Deng, Qixin Bai, Xinyi Wang, Yanbo Xue, Wei Xie, Lei Audio and Speech Processing Music structure analysis (MSA) underpins music understanding and controllable generation, yet progress has been limited by small, inconsistent corpora. We present SongFormer, a scalable framework that learns from heterogeneous supervision. SongFormer (i) fuses short- and long-window self-supervised learning representations to capture both fine-grained and long-range dependencies, and (ii) introduces a learned source embedding to enable training with partial, noisy, and schema-mismatched labels. To support scaling and fair evaluation, we release SongFormDB, the largest MSA corpus to date (over 14k songs spanning languages and genres), and SongFormBench, a 300-song expert-verified benchmark. On SongFormBench, SongFormer sets a new state of the art in strict boundary detection (HR.5F) and achieves the highest functional label accuracy, while remaining computationally efficient; it surpasses strong baselines and Gemini 2.5 Pro on these metrics and remains competitive under relaxed tolerance (HR3F). Code, datasets, and model are open-sourced at https://github.com/ASLP-lab/SongFormer. |
| title | SongFormer: Scaling Music Structure Analysis with Heterogeneous Supervision |
| topic | Audio and Speech Processing |
| url | https://arxiv.org/abs/2510.02797 |