Saved in:
Bibliographic Details
Main Authors: Hao, Chunbo, Yuan, Ruibin, Yao, Jixun, Deng, Qixin, Bai, Xinyi, Wang, Yanbo, Xue, Wei, Xie, Lei
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2510.02797
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866913014426370048
author Hao, Chunbo
Yuan, Ruibin
Yao, Jixun
Deng, Qixin
Bai, Xinyi
Wang, Yanbo
Xue, Wei
Xie, Lei
author_facet Hao, Chunbo
Yuan, Ruibin
Yao, Jixun
Deng, Qixin
Bai, Xinyi
Wang, Yanbo
Xue, Wei
Xie, Lei
contents Music structure analysis (MSA) underpins music understanding and controllable generation, yet progress has been limited by small, inconsistent corpora. We present SongFormer, a scalable framework that learns from heterogeneous supervision. SongFormer (i) fuses short- and long-window self-supervised learning representations to capture both fine-grained and long-range dependencies, and (ii) introduces a learned source embedding to enable training with partial, noisy, and schema-mismatched labels. To support scaling and fair evaluation, we release SongFormDB, the largest MSA corpus to date (over 14k songs spanning languages and genres), and SongFormBench, a 300-song expert-verified benchmark. On SongFormBench, SongFormer sets a new state of the art in strict boundary detection (HR.5F) and achieves the highest functional label accuracy, while remaining computationally efficient; it surpasses strong baselines and Gemini 2.5 Pro on these metrics and remains competitive under relaxed tolerance (HR3F). Code, datasets, and model are open-sourced at https://github.com/ASLP-lab/SongFormer.
format Preprint
id arxiv_https___arxiv_org_abs_2510_02797
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle SongFormer: Scaling Music Structure Analysis with Heterogeneous Supervision
Hao, Chunbo
Yuan, Ruibin
Yao, Jixun
Deng, Qixin
Bai, Xinyi
Wang, Yanbo
Xue, Wei
Xie, Lei
Audio and Speech Processing
Music structure analysis (MSA) underpins music understanding and controllable generation, yet progress has been limited by small, inconsistent corpora. We present SongFormer, a scalable framework that learns from heterogeneous supervision. SongFormer (i) fuses short- and long-window self-supervised learning representations to capture both fine-grained and long-range dependencies, and (ii) introduces a learned source embedding to enable training with partial, noisy, and schema-mismatched labels. To support scaling and fair evaluation, we release SongFormDB, the largest MSA corpus to date (over 14k songs spanning languages and genres), and SongFormBench, a 300-song expert-verified benchmark. On SongFormBench, SongFormer sets a new state of the art in strict boundary detection (HR.5F) and achieves the highest functional label accuracy, while remaining computationally efficient; it surpasses strong baselines and Gemini 2.5 Pro on these metrics and remains competitive under relaxed tolerance (HR3F). Code, datasets, and model are open-sourced at https://github.com/ASLP-lab/SongFormer.
title SongFormer: Scaling Music Structure Analysis with Heterogeneous Supervision
topic Audio and Speech Processing
url https://arxiv.org/abs/2510.02797