Vista Equipo: :: Library Catalog

Guardado en:

Detalles Bibliográficos
Autores principales:	Chen, Yu, Zhu, Hongxu, Wang, Jiadong, Chen, Kainan, Qian, Xinyuan
Formato:	Preprint
Publicado:	2025
Materias:	Sound Audio and Speech Processing
Acceso en línea:	https://arxiv.org/abs/2507.07384
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

_version_	1866911093532655616
author	Chen, Yu Zhu, Hongxu Wang, Jiadong Chen, Kainan Qian, Xinyuan
author_facet	Chen, Yu Zhu, Hongxu Wang, Jiadong Chen, Kainan Qian, Xinyuan
contents	Audio-visual sound source localization (AV-SSL) estimates the position of sound sources by fusing auditory and visual cues. Current AV-SSL methodologies typically require spatially-paired audio-visual data and cannot selectively localize specific target sources. To address these limitations, we introduce Cross-Instance Audio-Visual Localization (CI-AVL), a novel task that localizes target sound sources using visual prompts from different instances of the same semantic class. CI-AVL enables selective localization without spatially paired data. To solve this task, we propose AV-SSAN, a semantic-spatial alignment framework centered on a Multi-Band Semantic-Spatial Alignment Network (MB-SSA Net). MB-SSA Net decomposes the audio spectrogram into multiple frequency bands, aligns each band with semantic visual prompts, and refines spatial cues to estimate the direction-of-arrival (DoA). To facilitate this research, we construct VGGSound-SSL, a large-scale dataset comprising 13,981 spatial audio clips across 296 categories, each paired with visual prompts. AV-SSAN achieves a mean absolute error of 16.59 and an accuracy of 71.29%, significantly outperforming existing AV-SSL methods. Code and data will be public.
format	Preprint
id	arxiv_https___arxiv_org_abs_2507_07384
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	AV-SSAN: Audio-Visual Selective DoA Estimation through Explicit Multi-Band Semantic-Spatial Alignment Chen, Yu Zhu, Hongxu Wang, Jiadong Chen, Kainan Qian, Xinyuan Sound Audio and Speech Processing Audio-visual sound source localization (AV-SSL) estimates the position of sound sources by fusing auditory and visual cues. Current AV-SSL methodologies typically require spatially-paired audio-visual data and cannot selectively localize specific target sources. To address these limitations, we introduce Cross-Instance Audio-Visual Localization (CI-AVL), a novel task that localizes target sound sources using visual prompts from different instances of the same semantic class. CI-AVL enables selective localization without spatially paired data. To solve this task, we propose AV-SSAN, a semantic-spatial alignment framework centered on a Multi-Band Semantic-Spatial Alignment Network (MB-SSA Net). MB-SSA Net decomposes the audio spectrogram into multiple frequency bands, aligns each band with semantic visual prompts, and refines spatial cues to estimate the direction-of-arrival (DoA). To facilitate this research, we construct VGGSound-SSL, a large-scale dataset comprising 13,981 spatial audio clips across 296 categories, each paired with visual prompts. AV-SSAN achieves a mean absolute error of 16.59 and an accuracy of 71.29%, significantly outperforming existing AV-SSL methods. Code and data will be public.
title	AV-SSAN: Audio-Visual Selective DoA Estimation through Explicit Multi-Band Semantic-Spatial Alignment
topic	Sound Audio and Speech Processing
url	https://arxiv.org/abs/2507.07384

Ejemplares similares