Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhao, Jinghua, Jia, Yuhang, Wang, Shiyao, Zhou, Jiaming, Wang, Hui, Qin, Yong
Format:	Preprint
Published:	2025
Subjects:	Multimedia Artificial Intelligence
Online Access:	https://arxiv.org/abs/2504.15066
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909586433245184
author	Zhao, Jinghua Jia, Yuhang Wang, Shiyao Zhou, Jiaming Wang, Hui Qin, Yong
author_facet	Zhao, Jinghua Jia, Yuhang Wang, Shiyao Zhou, Jiaming Wang, Hui Qin, Yong
contents	Incorporating visual modalities to assist Automatic Speech Recognition (ASR) tasks has led to significant improvements. However, existing Audio-Visual Speech Recognition (AVSR) datasets and methods typically rely solely on lip-reading information or speaking contextual video, neglecting the potential of combining these different valuable visual cues within the speaking context. In this paper, we release a multimodal Chinese AVSR dataset, Chinese-LiPS, comprising 100 hours of speech, video, and corresponding manual transcription, with the visual modality encompassing both lip-reading information and the presentation slides used by the speaker. Based on Chinese-LiPS, we develop a simple yet effective pipeline, LiPS-AVSR, which leverages both lip-reading and presentation slide information as visual modalities for AVSR tasks. Experiments show that lip-reading and presentation slide information improve ASR performance by approximately 8\% and 25\%, respectively, with a combined performance improvement of about 35\%. The dataset is available at https://kiri0824.github.io/Chinese-LiPS/
format	Preprint
id	arxiv_https___arxiv_org_abs_2504_15066
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Chinese-LiPS: A Chinese audio-visual speech recognition dataset with Lip-reading and Presentation Slides Zhao, Jinghua Jia, Yuhang Wang, Shiyao Zhou, Jiaming Wang, Hui Qin, Yong Multimedia Artificial Intelligence Incorporating visual modalities to assist Automatic Speech Recognition (ASR) tasks has led to significant improvements. However, existing Audio-Visual Speech Recognition (AVSR) datasets and methods typically rely solely on lip-reading information or speaking contextual video, neglecting the potential of combining these different valuable visual cues within the speaking context. In this paper, we release a multimodal Chinese AVSR dataset, Chinese-LiPS, comprising 100 hours of speech, video, and corresponding manual transcription, with the visual modality encompassing both lip-reading information and the presentation slides used by the speaker. Based on Chinese-LiPS, we develop a simple yet effective pipeline, LiPS-AVSR, which leverages both lip-reading and presentation slide information as visual modalities for AVSR tasks. Experiments show that lip-reading and presentation slide information improve ASR performance by approximately 8\% and 25\%, respectively, with a combined performance improvement of about 35\%. The dataset is available at https://kiri0824.github.io/Chinese-LiPS/
title	Chinese-LiPS: A Chinese audio-visual speech recognition dataset with Lip-reading and Presentation Slides
topic	Multimedia Artificial Intelligence
url	https://arxiv.org/abs/2504.15066

Similar Items