Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Shimada, Kazuki, Politis, Archontis, Roman, Iran R., Sudarsanam, Parthasaarathy, Diaz-Guerra, David, Pandey, Ruchi, Uchida, Kengo, Koyama, Yuichiro, Takahashi, Naoya, Shibuya, Takashi, Takahashi, Shusuke, Virtanen, Tuomas, Mitsufuji, Yuki
Format:	Preprint
Published:	2025
Subjects:	Sound Computer Vision and Pattern Recognition Multimedia Audio and Speech Processing Image and Video Processing
Online Access:	https://arxiv.org/abs/2507.12042
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916846128594944
author	Shimada, Kazuki Politis, Archontis Roman, Iran R. Sudarsanam, Parthasaarathy Diaz-Guerra, David Pandey, Ruchi Uchida, Kengo Koyama, Yuichiro Takahashi, Naoya Shibuya, Takashi Takahashi, Shusuke Virtanen, Tuomas Mitsufuji, Yuki
author_facet	Shimada, Kazuki Politis, Archontis Roman, Iran R. Sudarsanam, Parthasaarathy Diaz-Guerra, David Pandey, Ruchi Uchida, Kengo Koyama, Yuichiro Takahashi, Naoya Shibuya, Takashi Takahashi, Shusuke Virtanen, Tuomas Mitsufuji, Yuki
contents	This paper presents the objective, dataset, baseline, and metrics of Task 3 of the DCASE2025 Challenge on sound event localization and detection (SELD). In previous editions, the challenge used four-channel audio formats of first-order Ambisonics (FOA) and microphone array. In contrast, this year's challenge investigates SELD with stereo audio data (termed stereo SELD). This change shifts the focus from more specialized 360° audio and audiovisual scene analysis to more commonplace audio and media scenarios with limited field-of-view (FOV). Due to inherent angular ambiguities in stereo audio data, the task focuses on direction-of-arrival (DOA) estimation in the azimuth plane (left-right axis) along with distance estimation. The challenge remains divided into two tracks: audio-only and audiovisual, with the audiovisual track introducing a new sub-task of onscreen/offscreen event classification necessitated by the limited FOV. This challenge introduces the DCASE2025 Task3 Stereo SELD Dataset, whose stereo audio and perspective video clips are sampled and converted from the STARSS23 recordings. The baseline system is designed to process stereo audio and corresponding video frames as inputs. In addition to the typical SELD event classification and localization, it integrates onscreen/offscreen classification for the audiovisual track. The evaluation metrics have been modified to introduce an onscreen/offscreen accuracy metric, which assesses the models' ability to identify which sound sources are onscreen. In the experimental evaluation, the baseline system performs reasonably well with the stereo audio data.
format	Preprint
id	arxiv_https___arxiv_org_abs_2507_12042
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Stereo Sound Event Localization and Detection with Onscreen/offscreen Classification Shimada, Kazuki Politis, Archontis Roman, Iran R. Sudarsanam, Parthasaarathy Diaz-Guerra, David Pandey, Ruchi Uchida, Kengo Koyama, Yuichiro Takahashi, Naoya Shibuya, Takashi Takahashi, Shusuke Virtanen, Tuomas Mitsufuji, Yuki Sound Computer Vision and Pattern Recognition Multimedia Audio and Speech Processing Image and Video Processing This paper presents the objective, dataset, baseline, and metrics of Task 3 of the DCASE2025 Challenge on sound event localization and detection (SELD). In previous editions, the challenge used four-channel audio formats of first-order Ambisonics (FOA) and microphone array. In contrast, this year's challenge investigates SELD with stereo audio data (termed stereo SELD). This change shifts the focus from more specialized 360° audio and audiovisual scene analysis to more commonplace audio and media scenarios with limited field-of-view (FOV). Due to inherent angular ambiguities in stereo audio data, the task focuses on direction-of-arrival (DOA) estimation in the azimuth plane (left-right axis) along with distance estimation. The challenge remains divided into two tracks: audio-only and audiovisual, with the audiovisual track introducing a new sub-task of onscreen/offscreen event classification necessitated by the limited FOV. This challenge introduces the DCASE2025 Task3 Stereo SELD Dataset, whose stereo audio and perspective video clips are sampled and converted from the STARSS23 recordings. The baseline system is designed to process stereo audio and corresponding video frames as inputs. In addition to the typical SELD event classification and localization, it integrates onscreen/offscreen classification for the audiovisual track. The evaluation metrics have been modified to introduce an onscreen/offscreen accuracy metric, which assesses the models' ability to identify which sound sources are onscreen. In the experimental evaluation, the baseline system performs reasonably well with the stereo audio data.
title	Stereo Sound Event Localization and Detection with Onscreen/offscreen Classification
topic	Sound Computer Vision and Pattern Recognition Multimedia Audio and Speech Processing Image and Video Processing
url	https://arxiv.org/abs/2507.12042

Similar Items