Enregistré dans:
Détails bibliographiques
Auteurs principaux: Jin, Derong, Gao, Ruohan
Format: Preprint
Publié: 2025
Sujets:
Accès en ligne:https://arxiv.org/abs/2504.21847
Tags: Ajouter un tag
Pas de tags, Soyez le premier à ajouter un tag!
_version_ 1866915447861936128
author Jin, Derong
Gao, Ruohan
author_facet Jin, Derong
Gao, Ruohan
contents An immersive acoustic experience enabled by spatial audio is just as crucial as the visual aspect in creating realistic virtual environments. However, existing methods for room impulse response estimation rely either on data-demanding learning-based models or computationally expensive physics-based modeling. In this work, we introduce Audio-Visual Differentiable Room Acoustic Rendering (AV-DAR), a framework that leverages visual cues extracted from multi-view images and acoustic beam tracing for physics-based room acoustic rendering. Experiments across six real-world environments from two datasets demonstrate that our multimodal, physics-based approach is efficient, interpretable, and accurate, significantly outperforming a series of prior methods. Notably, on the Real Acoustic Field dataset, AV-DAR achieves comparable performance to models trained on 10 times more data while delivering relative gains ranging from 16.6% to 50.9% when trained at the same scale.
format Preprint
id arxiv_https___arxiv_org_abs_2504_21847
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Differentiable Room Acoustic Rendering with Multi-View Vision Priors
Jin, Derong
Gao, Ruohan
Computer Vision and Pattern Recognition
Sound
An immersive acoustic experience enabled by spatial audio is just as crucial as the visual aspect in creating realistic virtual environments. However, existing methods for room impulse response estimation rely either on data-demanding learning-based models or computationally expensive physics-based modeling. In this work, we introduce Audio-Visual Differentiable Room Acoustic Rendering (AV-DAR), a framework that leverages visual cues extracted from multi-view images and acoustic beam tracing for physics-based room acoustic rendering. Experiments across six real-world environments from two datasets demonstrate that our multimodal, physics-based approach is efficient, interpretable, and accurate, significantly outperforming a series of prior methods. Notably, on the Real Acoustic Field dataset, AV-DAR achieves comparable performance to models trained on 10 times more data while delivering relative gains ranging from 16.6% to 50.9% when trained at the same scale.
title Differentiable Room Acoustic Rendering with Multi-View Vision Priors
topic Computer Vision and Pattern Recognition
Sound
url https://arxiv.org/abs/2504.21847