Saved in:
| Main Authors: | Laux, Hendrik, Schmeink, Anke |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2409.07210 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer
by: Burchi, Maxime, et al.
Published: (2024)
by: Burchi, Maxime, et al.
Published: (2024)
Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation
by: Kim, Minsu, et al.
Published: (2024)
by: Kim, Minsu, et al.
Published: (2024)
Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation
by: Rouditchenko, Andrew, et al.
Published: (2024)
by: Rouditchenko, Andrew, et al.
Published: (2024)
MoME: Mixture of Matryoshka Experts for Audio-Visual Speech Recognition
by: Cappellazzo, Umberto, et al.
Published: (2025)
by: Cappellazzo, Umberto, et al.
Published: (2025)
CNVSRC 2024: The Second Chinese Continuous Visual Speech Recognition Challenge
by: Liu, Zehua, et al.
Published: (2025)
by: Liu, Zehua, et al.
Published: (2025)
mWhisper-Flamingo for Multilingual Audio-Visual Noise-Robust Speech Recognition
by: Rouditchenko, Andrew, et al.
Published: (2025)
by: Rouditchenko, Andrew, et al.
Published: (2025)
Mitigating Attention Sinks and Massive Activations in Audio-Visual Speech Recognition with LLMs
by: Anand, et al.
Published: (2025)
by: Anand, et al.
Published: (2025)
Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs
by: Cappellazzo, Umberto, et al.
Published: (2025)
by: Cappellazzo, Umberto, et al.
Published: (2025)
AlignVSR: Audio-Visual Cross-Modal Alignment for Visual Speech Recognition
by: Liu, Zehua, et al.
Published: (2024)
by: Liu, Zehua, et al.
Published: (2024)
VAPO: End-to-end Slide-Enhanced Speech Recognition with Omni-modal Large Language Models
by: Hu, Rui, et al.
Published: (2025)
by: Hu, Rui, et al.
Published: (2025)
Leveraging Large Language Models in Visual Speech Recognition: Model Scaling, Context-Aware Decoding, and Iterative Polishing
by: Liu, Zehua, et al.
Published: (2025)
by: Liu, Zehua, et al.
Published: (2025)
Dr. SHAP-AV: Decoding Relative Modality Contributions via Shapley Attribution in Audio-Visual Speech Recognition
by: Cappellazzo, Umberto, et al.
Published: (2026)
by: Cappellazzo, Umberto, et al.
Published: (2026)
Large Language Models are Strong Audio-Visual Speech Recognition Learners
by: Cappellazzo, Umberto, et al.
Published: (2024)
by: Cappellazzo, Umberto, et al.
Published: (2024)
Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations
by: Yeo, Jeong Hun, et al.
Published: (2025)
by: Yeo, Jeong Hun, et al.
Published: (2025)
MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens
by: Yeo, Jeong Hun, et al.
Published: (2025)
by: Yeo, Jeong Hun, et al.
Published: (2025)
UniCUE: Unified Recognition and Generation Framework for Chinese Cued Speech Video-to-Speech Generation
by: Wang, Jinting, et al.
Published: (2025)
by: Wang, Jinting, et al.
Published: (2025)
SlideAVSR: A Dataset of Paper Explanation Videos for Audio-Visual Speech Recognition
by: Wang, Hao, et al.
Published: (2024)
by: Wang, Hao, et al.
Published: (2024)
AD-AVSR: Asymmetric Dual-stream Enhancement for Robust Audio-Visual Speech Recognition
by: Xue, Junxiao, et al.
Published: (2025)
by: Xue, Junxiao, et al.
Published: (2025)
Computation and Parameter Efficient Multi-Modal Fusion Transformer for Cued Speech Recognition
by: Liu, Lei, et al.
Published: (2024)
by: Liu, Lei, et al.
Published: (2024)
Omni-AVSR: Towards Unified Multimodal Speech Recognition with Large Language Models
by: Cappellazzo, Umberto, et al.
Published: (2025)
by: Cappellazzo, Umberto, et al.
Published: (2025)
Seeing Speech and Sound: Distinguishing and Locating Audios in Visual Scenes
by: Ryu, Hyeonggon, et al.
Published: (2025)
by: Ryu, Hyeonggon, et al.
Published: (2025)
SwinLip: An Efficient Visual Speech Encoder for Lip Reading Using Swin Transformer
by: Park, Young-Hu, et al.
Published: (2025)
by: Park, Young-Hu, et al.
Published: (2025)
RTFS-Net: Recurrent Time-Frequency Modelling for Efficient Audio-Visual Speech Separation
by: Pegg, Samuel, et al.
Published: (2023)
by: Pegg, Samuel, et al.
Published: (2023)
AISHELL6-whisper: A Chinese Mandarin Audio-visual Whisper Speech Dataset with Speech Recognition Baselines
by: Li, Cancan, et al.
Published: (2025)
by: Li, Cancan, et al.
Published: (2025)
Emotional Vietnamese Speech-Based Depression Diagnosis Using Dynamic Attention Mechanism
by: D., Quang-Anh N., et al.
Published: (2024)
by: D., Quang-Anh N., et al.
Published: (2024)
Robust Audiovisual Speech Recognition Models with Mixture-of-Experts
by: Wu, Yihan, et al.
Published: (2024)
by: Wu, Yihan, et al.
Published: (2024)
Dynamic Derivation and Elimination: Audio Visual Segmentation with Enhanced Audio Semantics
by: Liu, Chen, et al.
Published: (2025)
by: Liu, Chen, et al.
Published: (2025)
Emotional Face-to-Speech
by: Ye, Jiaxin, et al.
Published: (2025)
by: Ye, Jiaxin, et al.
Published: (2025)
V2SFlow: Video-to-Speech Generation with Speech Decomposition and Rectified Flow
by: Choi, Jeongsoo, et al.
Published: (2024)
by: Choi, Jeongsoo, et al.
Published: (2024)
United we stand, Divided we fall: Handling Weak Complementary Relationships for Audio-Visual Emotion Recognition in Valence-Arousal Space
by: Praveen, R. Gnana, et al.
Published: (2025)
by: Praveen, R. Gnana, et al.
Published: (2025)
JEP-KD: Joint-Embedding Predictive Architecture Based Knowledge Distillation for Visual Speech Recognition
by: Sun, Chang, et al.
Published: (2024)
by: Sun, Chang, et al.
Published: (2024)
Towards Reliable Audio Deepfake Attribution and Model Recognition: A Multi-Level Autoencoder-Based Framework
by: Di Pierno, Andrea, et al.
Published: (2025)
by: Di Pierno, Andrea, et al.
Published: (2025)
IIANet: An Intra- and Inter-Modality Attention Network for Audio-Visual Speech Separation
by: Li, Kai, et al.
Published: (2023)
by: Li, Kai, et al.
Published: (2023)
Audio-Visual Speech Enhancement In Complex Scenarios With Separation And Dereverberation Joint Modeling
by: Du, Jiarong, et al.
Published: (2025)
by: Du, Jiarong, et al.
Published: (2025)
From Detection to Correction: Backdoor-Resilient Face Recognition via Vision-Language Trigger Detection and Noise-Based Neutralization
by: Wahida, Farah, et al.
Published: (2025)
by: Wahida, Farah, et al.
Published: (2025)
Global-Local Distillation Network-Based Audio-Visual Speaker Tracking with Incomplete Modalities
by: Li, Yidi, et al.
Published: (2024)
by: Li, Yidi, et al.
Published: (2024)
Input Conditioned Layer Dropping in Speech Foundation Models
by: Hannan, Abdul, et al.
Published: (2025)
by: Hannan, Abdul, et al.
Published: (2025)
RT-LA-VocE: Real-Time Low-SNR Audio-Visual Speech Enhancement
by: Chen, Honglie, et al.
Published: (2024)
by: Chen, Honglie, et al.
Published: (2024)
Spiking Structured State Space Model for Monaural Speech Enhancement
by: Du, Yu, et al.
Published: (2023)
by: Du, Yu, et al.
Published: (2023)
DiffSSD: A Diffusion-Based Dataset For Speech Forensics
by: Bhagtani, Kratika, et al.
Published: (2024)
by: Bhagtani, Kratika, et al.
Published: (2024)
Similar Items
-
Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer
by: Burchi, Maxime, et al.
Published: (2024) -
Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation
by: Kim, Minsu, et al.
Published: (2024) -
Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation
by: Rouditchenko, Andrew, et al.
Published: (2024) -
MoME: Mixture of Matryoshka Experts for Audio-Visual Speech Recognition
by: Cappellazzo, Umberto, et al.
Published: (2025) -
CNVSRC 2024: The Second Chinese Continuous Visual Speech Recognition Challenge
by: Liu, Zehua, et al.
Published: (2025)