Enregistré dans:
Détails bibliographiques
Auteurs principaux: Nihal, Ragib Amin, Yen, Benjamin, Shi, Runwu, Ashizawa, Takeshi, Nakadai, Kazuhiro
Format: Preprint
Publié: 2025
Sujets:
Accès en ligne:https://arxiv.org/abs/2502.20838
Tags: Ajouter un tag
Pas de tags, Soyez le premier à ajouter un tag!
_version_ 1866911725559742464
author Nihal, Ragib Amin
Yen, Benjamin
Shi, Runwu
Ashizawa, Takeshi
Nakadai, Kazuhiro
author_facet Nihal, Ragib Amin
Yen, Benjamin
Shi, Runwu
Ashizawa, Takeshi
Nakadai, Kazuhiro
contents Passive acoustic monitoring (PAM) systems generate continuous recordings spanning months, yet automated bioacoustic analysis of whale calls requires two separate annotation efforts: binary presence labels for classification and precise temporal boundaries for localization. A binary label for a multi-minute recording can be assigned in seconds, but timestamping every call within it requires hours of expert effort. Providing both is infeasible at operational scale. We present DSMIL-LocNet, a weakly supervised multiple instance learning (MIL) framework that performs both classification and temporal localization using only recording-level presence/absence labels. Our dual-stream architecture integrates spectral and temporal features to process recordings of 2--30 minutes without the temporal compression that degrades existing CNN methods on long inputs. On the AcousticTrends BlueFinLibrary, DSMIL-LocNet achieves F1 scores of 0.88--0.91 on recordings of 300--1800s, where fully supervised CNN baselines degrade to 0.19--0.64. It also provides temporal localization that these baselines cannot produce without frame-level annotation. Code: https://github.com/Ragib-Amin-Nihal/DSMIL-Loc
format Preprint
id arxiv_https___arxiv_org_abs_2502_20838
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Weakly Supervised Detection and Temporal Localization of Whale Calls in Long-Duration Bioacoustic Data
Nihal, Ragib Amin
Yen, Benjamin
Shi, Runwu
Ashizawa, Takeshi
Nakadai, Kazuhiro
Sound
Artificial Intelligence
Machine Learning
Audio and Speech Processing
Passive acoustic monitoring (PAM) systems generate continuous recordings spanning months, yet automated bioacoustic analysis of whale calls requires two separate annotation efforts: binary presence labels for classification and precise temporal boundaries for localization. A binary label for a multi-minute recording can be assigned in seconds, but timestamping every call within it requires hours of expert effort. Providing both is infeasible at operational scale. We present DSMIL-LocNet, a weakly supervised multiple instance learning (MIL) framework that performs both classification and temporal localization using only recording-level presence/absence labels. Our dual-stream architecture integrates spectral and temporal features to process recordings of 2--30 minutes without the temporal compression that degrades existing CNN methods on long inputs. On the AcousticTrends BlueFinLibrary, DSMIL-LocNet achieves F1 scores of 0.88--0.91 on recordings of 300--1800s, where fully supervised CNN baselines degrade to 0.19--0.64. It also provides temporal localization that these baselines cannot produce without frame-level annotation. Code: https://github.com/Ragib-Amin-Nihal/DSMIL-Loc
title Weakly Supervised Detection and Temporal Localization of Whale Calls in Long-Duration Bioacoustic Data
topic Sound
Artificial Intelligence
Machine Learning
Audio and Speech Processing
url https://arxiv.org/abs/2502.20838