Saved in:
Bibliographic Details
Main Authors: Zou, Liang, Yan, Genwei, Wang, Ruoyu, Du, Jun, Lei, Meng, Gao, Tian, Fang, Xin
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2403.11091
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866929281968373760
author Zou, Liang
Yan, Genwei
Wang, Ruoyu
Du, Jun
Lei, Meng
Gao, Tian
Fang, Xin
author_facet Zou, Liang
Yan, Genwei
Wang, Ruoyu
Du, Jun
Lei, Meng
Gao, Tian
Fang, Xin
contents This paper focuses on few-shot Sound Event Detection (SED), which aims to automatically recognize and classify sound events with limited samples. However, prevailing methods methods in few-shot SED predominantly rely on segment-level predictions, which often providing detailed, fine-grained predictions, particularly for events of brief duration. Although frame-level prediction strategies have been proposed to overcome these limitations, these strategies commonly face difficulties with prediction truncation caused by background noise. To alleviate this issue, we introduces an innovative multitask frame-level SED framework. In addition, we introduce TimeFilterAug, a linear timing mask for data augmentation, to increase the model's robustness and adaptability to diverse acoustic environments. The proposed method achieves a F-score of 63.8%, securing the 1st rank in the few-shot bioacoustic event detection category of the Detection and Classification of Acoustic Scenes and Events Challenge 2023.
format Preprint
id arxiv_https___arxiv_org_abs_2403_11091
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Multitask frame-level learning for few-shot sound event detection
Zou, Liang
Yan, Genwei
Wang, Ruoyu
Du, Jun
Lei, Meng
Gao, Tian
Fang, Xin
Sound
Computer Vision and Pattern Recognition
Audio and Speech Processing
This paper focuses on few-shot Sound Event Detection (SED), which aims to automatically recognize and classify sound events with limited samples. However, prevailing methods methods in few-shot SED predominantly rely on segment-level predictions, which often providing detailed, fine-grained predictions, particularly for events of brief duration. Although frame-level prediction strategies have been proposed to overcome these limitations, these strategies commonly face difficulties with prediction truncation caused by background noise. To alleviate this issue, we introduces an innovative multitask frame-level SED framework. In addition, we introduce TimeFilterAug, a linear timing mask for data augmentation, to increase the model's robustness and adaptability to diverse acoustic environments. The proposed method achieves a F-score of 63.8%, securing the 1st rank in the few-shot bioacoustic event detection category of the Detection and Classification of Acoustic Scenes and Events Challenge 2023.
title Multitask frame-level learning for few-shot sound event detection
topic Sound
Computer Vision and Pattern Recognition
Audio and Speech Processing
url https://arxiv.org/abs/2403.11091