Salvato in:
| Autori principali: | , , , , |
|---|---|
| Natura: | Preprint |
| Pubblicazione: |
2026
|
| Soggetti: | |
| Accesso online: | https://arxiv.org/abs/2602.08293 |
| Tags: |
Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!
|
| _version_ | 1866911433849044992 |
|---|---|
| author | Ok, Seaone Choi, Min Jun Kim, Eungbeom Han, Seungu Lee, Kyogu |
| author_facet | Ok, Seaone Choi, Min Jun Kim, Eungbeom Han, Seungu Lee, Kyogu |
| contents | Audio-Visual Speech Recognition (AVSR) leverages both acoustic and visual cues to improve speech recognition under noisy conditions. A central question is how to design a fusion mechanism that allows the model to effectively exploit visual information when the audio signal is degraded, while maintaining strong performance on clean speech. We propose CoBRA (Cross-modal Bottleneck for Robust AVSR), a bottleneck-based fusion framework that introduces a compact set of learnable tokens to mediate cross-modal exchange. By regulating information flow through these tokens, the audio stream can reliably access essential visual cues even under adverse or out-of-domain noise. Despite limited training data, our model surpasses comparable baselines and remains competitive with large-scale systems through noise-adaptive fusion, demonstrating both efficiency and robustness. Ablation studies highlight that the depth of fusion is the most critical factor, underscoring its importance in designing robust AVSR systems. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2602_08293 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | Cross-Modal Bottleneck Fusion For Noise Robust Audio-Visual Speech Recognition Ok, Seaone Choi, Min Jun Kim, Eungbeom Han, Seungu Lee, Kyogu Audio and Speech Processing Audio-Visual Speech Recognition (AVSR) leverages both acoustic and visual cues to improve speech recognition under noisy conditions. A central question is how to design a fusion mechanism that allows the model to effectively exploit visual information when the audio signal is degraded, while maintaining strong performance on clean speech. We propose CoBRA (Cross-modal Bottleneck for Robust AVSR), a bottleneck-based fusion framework that introduces a compact set of learnable tokens to mediate cross-modal exchange. By regulating information flow through these tokens, the audio stream can reliably access essential visual cues even under adverse or out-of-domain noise. Despite limited training data, our model surpasses comparable baselines and remains competitive with large-scale systems through noise-adaptive fusion, demonstrating both efficiency and robustness. Ablation studies highlight that the depth of fusion is the most critical factor, underscoring its importance in designing robust AVSR systems. |
| title | Cross-Modal Bottleneck Fusion For Noise Robust Audio-Visual Speech Recognition |
| topic | Audio and Speech Processing |
| url | https://arxiv.org/abs/2602.08293 |