Salvato in:
Dettagli Bibliografici
Autori principali: Ok, Seaone, Choi, Min Jun, Kim, Eungbeom, Han, Seungu, Lee, Kyogu
Natura: Preprint
Pubblicazione: 2026
Soggetti:
Accesso online:https://arxiv.org/abs/2602.08293
Tags: Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!
_version_ 1866911433849044992
author Ok, Seaone
Choi, Min Jun
Kim, Eungbeom
Han, Seungu
Lee, Kyogu
author_facet Ok, Seaone
Choi, Min Jun
Kim, Eungbeom
Han, Seungu
Lee, Kyogu
contents Audio-Visual Speech Recognition (AVSR) leverages both acoustic and visual cues to improve speech recognition under noisy conditions. A central question is how to design a fusion mechanism that allows the model to effectively exploit visual information when the audio signal is degraded, while maintaining strong performance on clean speech. We propose CoBRA (Cross-modal Bottleneck for Robust AVSR), a bottleneck-based fusion framework that introduces a compact set of learnable tokens to mediate cross-modal exchange. By regulating information flow through these tokens, the audio stream can reliably access essential visual cues even under adverse or out-of-domain noise. Despite limited training data, our model surpasses comparable baselines and remains competitive with large-scale systems through noise-adaptive fusion, demonstrating both efficiency and robustness. Ablation studies highlight that the depth of fusion is the most critical factor, underscoring its importance in designing robust AVSR systems.
format Preprint
id arxiv_https___arxiv_org_abs_2602_08293
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Cross-Modal Bottleneck Fusion For Noise Robust Audio-Visual Speech Recognition
Ok, Seaone
Choi, Min Jun
Kim, Eungbeom
Han, Seungu
Lee, Kyogu
Audio and Speech Processing
Audio-Visual Speech Recognition (AVSR) leverages both acoustic and visual cues to improve speech recognition under noisy conditions. A central question is how to design a fusion mechanism that allows the model to effectively exploit visual information when the audio signal is degraded, while maintaining strong performance on clean speech. We propose CoBRA (Cross-modal Bottleneck for Robust AVSR), a bottleneck-based fusion framework that introduces a compact set of learnable tokens to mediate cross-modal exchange. By regulating information flow through these tokens, the audio stream can reliably access essential visual cues even under adverse or out-of-domain noise. Despite limited training data, our model surpasses comparable baselines and remains competitive with large-scale systems through noise-adaptive fusion, demonstrating both efficiency and robustness. Ablation studies highlight that the depth of fusion is the most critical factor, underscoring its importance in designing robust AVSR systems.
title Cross-Modal Bottleneck Fusion For Noise Robust Audio-Visual Speech Recognition
topic Audio and Speech Processing
url https://arxiv.org/abs/2602.08293