Saved in:
Bibliographic Details
Main Authors: Shimada, Kazuki, Politis, Archontis, Roman, Iran R., Sudarsanam, Parthasaarathy, Diaz-Guerra, David, Pandey, Ruchi, Uchida, Kengo, Koyama, Yuichiro, Takahashi, Naoya, Shibuya, Takashi, Takahashi, Shusuke, Virtanen, Tuomas, Mitsufuji, Yuki
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2507.12042
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866916846128594944
author Shimada, Kazuki
Politis, Archontis
Roman, Iran R.
Sudarsanam, Parthasaarathy
Diaz-Guerra, David
Pandey, Ruchi
Uchida, Kengo
Koyama, Yuichiro
Takahashi, Naoya
Shibuya, Takashi
Takahashi, Shusuke
Virtanen, Tuomas
Mitsufuji, Yuki
author_facet Shimada, Kazuki
Politis, Archontis
Roman, Iran R.
Sudarsanam, Parthasaarathy
Diaz-Guerra, David
Pandey, Ruchi
Uchida, Kengo
Koyama, Yuichiro
Takahashi, Naoya
Shibuya, Takashi
Takahashi, Shusuke
Virtanen, Tuomas
Mitsufuji, Yuki
contents This paper presents the objective, dataset, baseline, and metrics of Task 3 of the DCASE2025 Challenge on sound event localization and detection (SELD). In previous editions, the challenge used four-channel audio formats of first-order Ambisonics (FOA) and microphone array. In contrast, this year's challenge investigates SELD with stereo audio data (termed stereo SELD). This change shifts the focus from more specialized 360° audio and audiovisual scene analysis to more commonplace audio and media scenarios with limited field-of-view (FOV). Due to inherent angular ambiguities in stereo audio data, the task focuses on direction-of-arrival (DOA) estimation in the azimuth plane (left-right axis) along with distance estimation. The challenge remains divided into two tracks: audio-only and audiovisual, with the audiovisual track introducing a new sub-task of onscreen/offscreen event classification necessitated by the limited FOV. This challenge introduces the DCASE2025 Task3 Stereo SELD Dataset, whose stereo audio and perspective video clips are sampled and converted from the STARSS23 recordings. The baseline system is designed to process stereo audio and corresponding video frames as inputs. In addition to the typical SELD event classification and localization, it integrates onscreen/offscreen classification for the audiovisual track. The evaluation metrics have been modified to introduce an onscreen/offscreen accuracy metric, which assesses the models' ability to identify which sound sources are onscreen. In the experimental evaluation, the baseline system performs reasonably well with the stereo audio data.
format Preprint
id arxiv_https___arxiv_org_abs_2507_12042
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Stereo Sound Event Localization and Detection with Onscreen/offscreen Classification
Shimada, Kazuki
Politis, Archontis
Roman, Iran R.
Sudarsanam, Parthasaarathy
Diaz-Guerra, David
Pandey, Ruchi
Uchida, Kengo
Koyama, Yuichiro
Takahashi, Naoya
Shibuya, Takashi
Takahashi, Shusuke
Virtanen, Tuomas
Mitsufuji, Yuki
Sound
Computer Vision and Pattern Recognition
Multimedia
Audio and Speech Processing
Image and Video Processing
This paper presents the objective, dataset, baseline, and metrics of Task 3 of the DCASE2025 Challenge on sound event localization and detection (SELD). In previous editions, the challenge used four-channel audio formats of first-order Ambisonics (FOA) and microphone array. In contrast, this year's challenge investigates SELD with stereo audio data (termed stereo SELD). This change shifts the focus from more specialized 360° audio and audiovisual scene analysis to more commonplace audio and media scenarios with limited field-of-view (FOV). Due to inherent angular ambiguities in stereo audio data, the task focuses on direction-of-arrival (DOA) estimation in the azimuth plane (left-right axis) along with distance estimation. The challenge remains divided into two tracks: audio-only and audiovisual, with the audiovisual track introducing a new sub-task of onscreen/offscreen event classification necessitated by the limited FOV. This challenge introduces the DCASE2025 Task3 Stereo SELD Dataset, whose stereo audio and perspective video clips are sampled and converted from the STARSS23 recordings. The baseline system is designed to process stereo audio and corresponding video frames as inputs. In addition to the typical SELD event classification and localization, it integrates onscreen/offscreen classification for the audiovisual track. The evaluation metrics have been modified to introduce an onscreen/offscreen accuracy metric, which assesses the models' ability to identify which sound sources are onscreen. In the experimental evaluation, the baseline system performs reasonably well with the stereo audio data.
title Stereo Sound Event Localization and Detection with Onscreen/offscreen Classification
topic Sound
Computer Vision and Pattern Recognition
Multimedia
Audio and Speech Processing
Image and Video Processing
url https://arxiv.org/abs/2507.12042