:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Laux, Hendrik, Schmeink, Anke
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Sound Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2409.07210
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer
by: Burchi, Maxime, et al.
Published: (2024)

Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation
by: Kim, Minsu, et al.
Published: (2024)

Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation
by: Rouditchenko, Andrew, et al.
Published: (2024)

MoME: Mixture of Matryoshka Experts for Audio-Visual Speech Recognition
by: Cappellazzo, Umberto, et al.
Published: (2025)

CNVSRC 2024: The Second Chinese Continuous Visual Speech Recognition Challenge
by: Liu, Zehua, et al.
Published: (2025)

mWhisper-Flamingo for Multilingual Audio-Visual Noise-Robust Speech Recognition
by: Rouditchenko, Andrew, et al.
Published: (2025)

Mitigating Attention Sinks and Massive Activations in Audio-Visual Speech Recognition with LLMs
by: Anand, et al.
Published: (2025)

Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs
by: Cappellazzo, Umberto, et al.
Published: (2025)

AlignVSR: Audio-Visual Cross-Modal Alignment for Visual Speech Recognition
by: Liu, Zehua, et al.
Published: (2024)

VAPO: End-to-end Slide-Enhanced Speech Recognition with Omni-modal Large Language Models
by: Hu, Rui, et al.
Published: (2025)

Leveraging Large Language Models in Visual Speech Recognition: Model Scaling, Context-Aware Decoding, and Iterative Polishing
by: Liu, Zehua, et al.
Published: (2025)

Dr. SHAP-AV: Decoding Relative Modality Contributions via Shapley Attribution in Audio-Visual Speech Recognition
by: Cappellazzo, Umberto, et al.
Published: (2026)

Large Language Models are Strong Audio-Visual Speech Recognition Learners
by: Cappellazzo, Umberto, et al.
Published: (2024)

Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations
by: Yeo, Jeong Hun, et al.
Published: (2025)

MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens
by: Yeo, Jeong Hun, et al.
Published: (2025)

UniCUE: Unified Recognition and Generation Framework for Chinese Cued Speech Video-to-Speech Generation
by: Wang, Jinting, et al.
Published: (2025)

SlideAVSR: A Dataset of Paper Explanation Videos for Audio-Visual Speech Recognition
by: Wang, Hao, et al.
Published: (2024)

AD-AVSR: Asymmetric Dual-stream Enhancement for Robust Audio-Visual Speech Recognition
by: Xue, Junxiao, et al.
Published: (2025)

Computation and Parameter Efficient Multi-Modal Fusion Transformer for Cued Speech Recognition
by: Liu, Lei, et al.
Published: (2024)

Omni-AVSR: Towards Unified Multimodal Speech Recognition with Large Language Models
by: Cappellazzo, Umberto, et al.
Published: (2025)

Seeing Speech and Sound: Distinguishing and Locating Audios in Visual Scenes
by: Ryu, Hyeonggon, et al.
Published: (2025)

SwinLip: An Efficient Visual Speech Encoder for Lip Reading Using Swin Transformer
by: Park, Young-Hu, et al.
Published: (2025)

RTFS-Net: Recurrent Time-Frequency Modelling for Efficient Audio-Visual Speech Separation
by: Pegg, Samuel, et al.
Published: (2023)

AISHELL6-whisper: A Chinese Mandarin Audio-visual Whisper Speech Dataset with Speech Recognition Baselines
by: Li, Cancan, et al.
Published: (2025)

Emotional Vietnamese Speech-Based Depression Diagnosis Using Dynamic Attention Mechanism
by: D., Quang-Anh N., et al.
Published: (2024)

Robust Audiovisual Speech Recognition Models with Mixture-of-Experts
by: Wu, Yihan, et al.
Published: (2024)

Dynamic Derivation and Elimination: Audio Visual Segmentation with Enhanced Audio Semantics
by: Liu, Chen, et al.
Published: (2025)

Emotional Face-to-Speech
by: Ye, Jiaxin, et al.
Published: (2025)

V2SFlow: Video-to-Speech Generation with Speech Decomposition and Rectified Flow
by: Choi, Jeongsoo, et al.
Published: (2024)

United we stand, Divided we fall: Handling Weak Complementary Relationships for Audio-Visual Emotion Recognition in Valence-Arousal Space
by: Praveen, R. Gnana, et al.
Published: (2025)

JEP-KD: Joint-Embedding Predictive Architecture Based Knowledge Distillation for Visual Speech Recognition
by: Sun, Chang, et al.
Published: (2024)

Towards Reliable Audio Deepfake Attribution and Model Recognition: A Multi-Level Autoencoder-Based Framework
by: Di Pierno, Andrea, et al.
Published: (2025)

IIANet: An Intra- and Inter-Modality Attention Network for Audio-Visual Speech Separation
by: Li, Kai, et al.
Published: (2023)

Audio-Visual Speech Enhancement In Complex Scenarios With Separation And Dereverberation Joint Modeling
by: Du, Jiarong, et al.
Published: (2025)

From Detection to Correction: Backdoor-Resilient Face Recognition via Vision-Language Trigger Detection and Noise-Based Neutralization
by: Wahida, Farah, et al.
Published: (2025)

Global-Local Distillation Network-Based Audio-Visual Speaker Tracking with Incomplete Modalities
by: Li, Yidi, et al.
Published: (2024)

Input Conditioned Layer Dropping in Speech Foundation Models
by: Hannan, Abdul, et al.
Published: (2025)

RT-LA-VocE: Real-Time Low-SNR Audio-Visual Speech Enhancement
by: Chen, Honglie, et al.
Published: (2024)

Spiking Structured State Space Model for Monaural Speech Enhancement
by: Du, Yu, et al.
Published: (2023)

DiffSSD: A Diffusion-Based Dataset For Speech Forensics
by: Bhagtani, Kratika, et al.
Published: (2024)