Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Berg, Axel, Engman, Johanna, Gulin, Jens, Åström, Karl, Oskarsson, Magnus
Format:	Preprint
Published:	2024
Subjects:	Audio and Speech Processing Machine Learning
Online Access:	https://arxiv.org/abs/2408.17166
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916375580114944
author	Berg, Axel Engman, Johanna Gulin, Jens Åström, Karl Oskarsson, Magnus
author_facet	Berg, Axel Engman, Johanna Gulin, Jens Åström, Karl Oskarsson, Magnus
contents	Sound event localization and detection (SELD) systems using audio recordings from a microphone array rely on spatial cues for determining the location of sound events. As a consequence, the localization performance of such systems is to a large extent determined by the quality of the audio features that are used as inputs to the system. We propose a new feature, based on neural generalized cross-correlations with phase-transform (NGCC-PHAT), that learns audio representations suitable for localization. Using permutation invariant training for the time-difference of arrival (TDOA) estimation problem enables NGCC-PHAT to learn TDOA features for multiple overlapping sound events. These features can be used as a drop-in replacement for GCC-PHAT inputs to a SELD-network. We test our method on the STARSS23 dataset and demonstrate improved localization performance compared to using standard GCC-PHAT or SALSA-Lite input features.
format	Preprint
id	arxiv_https___arxiv_org_abs_2408_17166
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Learning Multi-Target TDOA Features for Sound Event Localization and Detection Berg, Axel Engman, Johanna Gulin, Jens Åström, Karl Oskarsson, Magnus Audio and Speech Processing Machine Learning Sound event localization and detection (SELD) systems using audio recordings from a microphone array rely on spatial cues for determining the location of sound events. As a consequence, the localization performance of such systems is to a large extent determined by the quality of the audio features that are used as inputs to the system. We propose a new feature, based on neural generalized cross-correlations with phase-transform (NGCC-PHAT), that learns audio representations suitable for localization. Using permutation invariant training for the time-difference of arrival (TDOA) estimation problem enables NGCC-PHAT to learn TDOA features for multiple overlapping sound events. These features can be used as a drop-in replacement for GCC-PHAT inputs to a SELD-network. We test our method on the STARSS23 dataset and demonstrate improved localization performance compared to using standard GCC-PHAT or SALSA-Lite input features.
title	Learning Multi-Target TDOA Features for Sound Event Localization and Detection
topic	Audio and Speech Processing Machine Learning
url	https://arxiv.org/abs/2408.17166

Similar Items