Internformat: :: Library Catalog

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Wu, Yihan, Lu, Yichen, Peng, Yifan, Wang, Xihua, Song, Ruihua, Watanabe, Shinji
Format:	Preprint
Veröffentlicht:	2024
Schlagworte:	Audio and Speech Processing Artificial Intelligence
Online-Zugang:	https://arxiv.org/abs/2412.19005
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

_version_	1866909441285160960
author	Wu, Yihan Lu, Yichen Peng, Yifan Wang, Xihua Song, Ruihua Watanabe, Shinji
author_facet	Wu, Yihan Lu, Yichen Peng, Yifan Wang, Xihua Song, Ruihua Watanabe, Shinji
contents	Audiovisual Automatic Speech Recognition (AV-ASR) aims to improve speech recognition accuracy by leveraging visual signals. It is particularly challenging in unconstrained real-world scenarios across various domains due to noisy acoustic environments, spontaneous speech, and the uncertain use of visual information. Most previous works fine-tune audio-only ASR models on audiovisual datasets, optimizing them for conventional ASR objectives. However, they often neglect visual features and common errors in unconstrained video scenarios. In this paper, we propose using a preference optimization strategy to improve speech recognition accuracy for real-world videos. First, we create preference data via simulating common errors that occurred in AV-ASR from two focals: manipulating the audio or vision input and rewriting the output transcript. Second, we propose BPO-AVASR, a Bifocal Preference Optimization method to improve AV-ASR models by leveraging both input-side and output-side preference. Extensive experiments demonstrate that our approach significantly improves speech recognition accuracy across various domains, outperforming previous state-of-the-art models on real-world video speech recognition.
format	Preprint
id	arxiv_https___arxiv_org_abs_2412_19005
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Enhancing Audiovisual Speech Recognition through Bifocal Preference Optimization Wu, Yihan Lu, Yichen Peng, Yifan Wang, Xihua Song, Ruihua Watanabe, Shinji Audio and Speech Processing Artificial Intelligence Audiovisual Automatic Speech Recognition (AV-ASR) aims to improve speech recognition accuracy by leveraging visual signals. It is particularly challenging in unconstrained real-world scenarios across various domains due to noisy acoustic environments, spontaneous speech, and the uncertain use of visual information. Most previous works fine-tune audio-only ASR models on audiovisual datasets, optimizing them for conventional ASR objectives. However, they often neglect visual features and common errors in unconstrained video scenarios. In this paper, we propose using a preference optimization strategy to improve speech recognition accuracy for real-world videos. First, we create preference data via simulating common errors that occurred in AV-ASR from two focals: manipulating the audio or vision input and rewriting the output transcript. Second, we propose BPO-AVASR, a Bifocal Preference Optimization method to improve AV-ASR models by leveraging both input-side and output-side preference. Extensive experiments demonstrate that our approach significantly improves speech recognition accuracy across various domains, outperforming previous state-of-the-art models on real-world video speech recognition.
title	Enhancing Audiovisual Speech Recognition through Bifocal Preference Optimization
topic	Audio and Speech Processing Artificial Intelligence
url	https://arxiv.org/abs/2412.19005

Ähnliche Einträge