Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2509.21739 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866914369021935616 |
|---|---|
| author | Yeung, Michael Toyama, Keisuke Teramoto, Toya Takahashi, Shusuke Kojima, Tamaki |
| author_facet | Yeung, Michael Toyama, Keisuke Teramoto, Toya Takahashi, Shusuke Kojima, Tamaki |
| contents | Automatic drum transcription (ADT) is traditionally formulated as a discriminative task to predict drum events from audio spectrograms. In this work, we redefine ADT as a conditional generative task and introduce Noise-to-Notes (N2N), a framework leveraging diffusion modeling to transform audio-conditioned Gaussian noise into drum events with associated velocities. This generative diffusion approach offers distinct advantages, including a flexible speed-accuracy trade-off and strong inpainting capabilities. However, the generation of binary onset and continuous velocity values presents a challenge for diffusion models, and to overcome this, we introduce an Annealed Pseudo-Huber loss to facilitate effective joint optimization. Finally, to augment low-level spectrogram features, we propose incorporating features extracted from music foundation models (MFMs), which capture high-level semantic information and enhance robustness to out-of-domain drum audio. Experimental results demonstrate that including MFM features significantly improves robustness and N2N establishes a new state-of-the-art performance across multiple ADT benchmarks. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2509_21739 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | Noise-to-Notes: Diffusion-based Generation and Refinement for Automatic Drum Transcription Yeung, Michael Toyama, Keisuke Teramoto, Toya Takahashi, Shusuke Kojima, Tamaki Sound Machine Learning Audio and Speech Processing Automatic drum transcription (ADT) is traditionally formulated as a discriminative task to predict drum events from audio spectrograms. In this work, we redefine ADT as a conditional generative task and introduce Noise-to-Notes (N2N), a framework leveraging diffusion modeling to transform audio-conditioned Gaussian noise into drum events with associated velocities. This generative diffusion approach offers distinct advantages, including a flexible speed-accuracy trade-off and strong inpainting capabilities. However, the generation of binary onset and continuous velocity values presents a challenge for diffusion models, and to overcome this, we introduce an Annealed Pseudo-Huber loss to facilitate effective joint optimization. Finally, to augment low-level spectrogram features, we propose incorporating features extracted from music foundation models (MFMs), which capture high-level semantic information and enhance robustness to out-of-domain drum audio. Experimental results demonstrate that including MFM features significantly improves robustness and N2N establishes a new state-of-the-art performance across multiple ADT benchmarks. |
| title | Noise-to-Notes: Diffusion-based Generation and Refinement for Automatic Drum Transcription |
| topic | Sound Machine Learning Audio and Speech Processing |
| url | https://arxiv.org/abs/2509.21739 |