MARC21: :: Library Catalog

Salvato in:

Dettagli Bibliografici
Autori principali:	Ok, Seaone, Choi, Min Jun, Kim, Eungbeom, Han, Seungu, Lee, Kyogu
Natura:	Preprint
Pubblicazione:	2026
Soggetti:	Audio and Speech Processing
Accesso online:	https://arxiv.org/abs/2602.08293
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

_version_	1866911433849044992
author	Ok, Seaone Choi, Min Jun Kim, Eungbeom Han, Seungu Lee, Kyogu
author_facet	Ok, Seaone Choi, Min Jun Kim, Eungbeom Han, Seungu Lee, Kyogu
contents	Audio-Visual Speech Recognition (AVSR) leverages both acoustic and visual cues to improve speech recognition under noisy conditions. A central question is how to design a fusion mechanism that allows the model to effectively exploit visual information when the audio signal is degraded, while maintaining strong performance on clean speech. We propose CoBRA (Cross-modal Bottleneck for Robust AVSR), a bottleneck-based fusion framework that introduces a compact set of learnable tokens to mediate cross-modal exchange. By regulating information flow through these tokens, the audio stream can reliably access essential visual cues even under adverse or out-of-domain noise. Despite limited training data, our model surpasses comparable baselines and remains competitive with large-scale systems through noise-adaptive fusion, demonstrating both efficiency and robustness. Ablation studies highlight that the depth of fusion is the most critical factor, underscoring its importance in designing robust AVSR systems.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_08293
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Cross-Modal Bottleneck Fusion For Noise Robust Audio-Visual Speech Recognition Ok, Seaone Choi, Min Jun Kim, Eungbeom Han, Seungu Lee, Kyogu Audio and Speech Processing Audio-Visual Speech Recognition (AVSR) leverages both acoustic and visual cues to improve speech recognition under noisy conditions. A central question is how to design a fusion mechanism that allows the model to effectively exploit visual information when the audio signal is degraded, while maintaining strong performance on clean speech. We propose CoBRA (Cross-modal Bottleneck for Robust AVSR), a bottleneck-based fusion framework that introduces a compact set of learnable tokens to mediate cross-modal exchange. By regulating information flow through these tokens, the audio stream can reliably access essential visual cues even under adverse or out-of-domain noise. Despite limited training data, our model surpasses comparable baselines and remains competitive with large-scale systems through noise-adaptive fusion, demonstrating both efficiency and robustness. Ablation studies highlight that the depth of fusion is the most critical factor, underscoring its importance in designing robust AVSR systems.
title	Cross-Modal Bottleneck Fusion For Noise Robust Audio-Visual Speech Recognition
topic	Audio and Speech Processing
url	https://arxiv.org/abs/2602.08293

Documenti analoghi