Saved in:
| Main Authors: | Pegg, Samuel, Li, Kai, Hu, Xiaolin |
|---|---|
| Format: | Preprint |
| Published: |
2023
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2309.17189 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
TDFNet: An Efficient Audio-Visual Speech Separation Model with Top-down Fusion
by: Pegg, Samuel, et al.
Published: (2024)
by: Pegg, Samuel, et al.
Published: (2024)
IIANet: An Intra- and Inter-Modality Attention Network for Audio-Visual Speech Separation
by: Li, Kai, et al.
Published: (2023)
by: Li, Kai, et al.
Published: (2023)
Audio-Visual Speech Enhancement In Complex Scenarios With Separation And Dereverberation Joint Modeling
by: Du, Jiarong, et al.
Published: (2025)
by: Du, Jiarong, et al.
Published: (2025)
A Fast and Lightweight Model for Causal Audio-Visual Speech Separation
by: Sang, Wendi, et al.
Published: (2025)
by: Sang, Wendi, et al.
Published: (2025)
Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation
by: Rouditchenko, Andrew, et al.
Published: (2024)
by: Rouditchenko, Andrew, et al.
Published: (2024)
mWhisper-Flamingo for Multilingual Audio-Visual Noise-Robust Speech Recognition
by: Rouditchenko, Andrew, et al.
Published: (2025)
by: Rouditchenko, Andrew, et al.
Published: (2025)
Seeing Speech and Sound: Distinguishing and Locating Audios in Visual Scenes
by: Ryu, Hyeonggon, et al.
Published: (2025)
by: Ryu, Hyeonggon, et al.
Published: (2025)
AlignVSR: Audio-Visual Cross-Modal Alignment for Visual Speech Recognition
by: Liu, Zehua, et al.
Published: (2024)
by: Liu, Zehua, et al.
Published: (2024)
MoME: Mixture of Matryoshka Experts for Audio-Visual Speech Recognition
by: Cappellazzo, Umberto, et al.
Published: (2025)
by: Cappellazzo, Umberto, et al.
Published: (2025)
MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens
by: Yeo, Jeong Hun, et al.
Published: (2025)
by: Yeo, Jeong Hun, et al.
Published: (2025)
Mitigating Attention Sinks and Massive Activations in Audio-Visual Speech Recognition with LLMs
by: Anand, et al.
Published: (2025)
by: Anand, et al.
Published: (2025)
Large Language Models are Strong Audio-Visual Speech Recognition Learners
by: Cappellazzo, Umberto, et al.
Published: (2024)
by: Cappellazzo, Umberto, et al.
Published: (2024)
Separate in the Speech Chain: Cross-Modal Conditional Audio-Visual Target Speech Extraction
by: Mu, Zhaoxi, et al.
Published: (2024)
by: Mu, Zhaoxi, et al.
Published: (2024)
Time-Frequency-Based Attention Cache Memory Model for Real-Time Speech Separation
by: Chen, Guo, et al.
Published: (2025)
by: Chen, Guo, et al.
Published: (2025)
Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation
by: Kim, Minsu, et al.
Published: (2024)
by: Kim, Minsu, et al.
Published: (2024)
SwinLip: An Efficient Visual Speech Encoder for Lip Reading Using Swin Transformer
by: Park, Young-Hu, et al.
Published: (2025)
by: Park, Young-Hu, et al.
Published: (2025)
RT-LA-VocE: Real-Time Low-SNR Audio-Visual Speech Enhancement
by: Chen, Honglie, et al.
Published: (2024)
by: Chen, Honglie, et al.
Published: (2024)
Dynamic Derivation and Elimination: Audio Visual Segmentation with Enhanced Audio Semantics
by: Liu, Chen, et al.
Published: (2025)
by: Liu, Chen, et al.
Published: (2025)
ZeroSep: Separate Anything in Audio with Zero Training
by: Huang, Chao, et al.
Published: (2025)
by: Huang, Chao, et al.
Published: (2025)
Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues
by: Chen, Tianxiang, et al.
Published: (2024)
by: Chen, Tianxiang, et al.
Published: (2024)
Seeing Soundscapes: Audio-Visual Generation and Separation from Soundscapes Using Audio-Visual Separator
by: Kang, Minjae, et al.
Published: (2025)
by: Kang, Minjae, et al.
Published: (2025)
Dr. SHAP-AV: Decoding Relative Modality Contributions via Shapley Attribution in Audio-Visual Speech Recognition
by: Cappellazzo, Umberto, et al.
Published: (2026)
by: Cappellazzo, Umberto, et al.
Published: (2026)
Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs
by: Cappellazzo, Umberto, et al.
Published: (2025)
by: Cappellazzo, Umberto, et al.
Published: (2025)
CCStereo: Audio-Visual Contextual and Contrastive Learning for Binaural Audio Generation
by: Chen, Yuanhong, et al.
Published: (2025)
by: Chen, Yuanhong, et al.
Published: (2025)
Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations
by: Yeo, Jeong Hun, et al.
Published: (2025)
by: Yeo, Jeong Hun, et al.
Published: (2025)
Continual Audio-Visual Sound Separation
by: Pian, Weiguo, et al.
Published: (2024)
by: Pian, Weiguo, et al.
Published: (2024)
Multiple Consistency-guided Test-Time Adaptation for Contrastive Audio-Language Models with Unlabeled Audio
by: Chen, Gongyu, et al.
Published: (2024)
by: Chen, Gongyu, et al.
Published: (2024)
Segment Beyond View: Handling Partially Missing Modality for Audio-Visual Semantic Segmentation
by: Wu, Renjie, et al.
Published: (2023)
by: Wu, Renjie, et al.
Published: (2023)
DDAVS: Disentangled Audio Semantics and Delayed Bidirectional Alignment for Audio-Visual Segmentation
by: Tian, Jingqi, et al.
Published: (2025)
by: Tian, Jingqi, et al.
Published: (2025)
AD-AVSR: Asymmetric Dual-stream Enhancement for Robust Audio-Visual Speech Recognition
by: Xue, Junxiao, et al.
Published: (2025)
by: Xue, Junxiao, et al.
Published: (2025)
SlideAVSR: A Dataset of Paper Explanation Videos for Audio-Visual Speech Recognition
by: Wang, Hao, et al.
Published: (2024)
by: Wang, Hao, et al.
Published: (2024)
Leveraging Large Language Models in Visual Speech Recognition: Model Scaling, Context-Aware Decoding, and Iterative Polishing
by: Liu, Zehua, et al.
Published: (2025)
by: Liu, Zehua, et al.
Published: (2025)
SSAVSV: Towards Unified Model for Self-Supervised Audio-Visual Speaker Verification
by: Rajasekhar, Gnana Praveen, et al.
Published: (2025)
by: Rajasekhar, Gnana Praveen, et al.
Published: (2025)
DeepAudio-V1:Towards Multi-Modal Multi-Stage End-to-End Video to Speech and Audio Generation
by: Zhang, Haomin, et al.
Published: (2025)
by: Zhang, Haomin, et al.
Published: (2025)
Wav2Sem: Plug-and-Play Audio Semantic Decoupling for 3D Speech-Driven Facial Animation
by: Li, Hao, et al.
Published: (2025)
by: Li, Hao, et al.
Published: (2025)
A Critical Assessment of Visual Sound Source Localization Models Including Negative Audio
by: Juanola, Xavier, et al.
Published: (2024)
by: Juanola, Xavier, et al.
Published: (2024)
CNVSRC 2024: The Second Chinese Continuous Visual Speech Recognition Challenge
by: Liu, Zehua, et al.
Published: (2025)
by: Liu, Zehua, et al.
Published: (2025)
MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual Transformers
by: Mahmud, Tanvir, et al.
Published: (2024)
by: Mahmud, Tanvir, et al.
Published: (2024)
Global-Local Distillation Network-Based Audio-Visual Speaker Tracking with Incomplete Modalities
by: Li, Yidi, et al.
Published: (2024)
by: Li, Yidi, et al.
Published: (2024)
Look, Listen and Recognise: Character-Aware Audio-Visual Subtitling
by: Korbar, Bruno, et al.
Published: (2024)
by: Korbar, Bruno, et al.
Published: (2024)
Similar Items
-
TDFNet: An Efficient Audio-Visual Speech Separation Model with Top-down Fusion
by: Pegg, Samuel, et al.
Published: (2024) -
IIANet: An Intra- and Inter-Modality Attention Network for Audio-Visual Speech Separation
by: Li, Kai, et al.
Published: (2023) -
Audio-Visual Speech Enhancement In Complex Scenarios With Separation And Dereverberation Joint Modeling
by: Du, Jiarong, et al.
Published: (2025) -
A Fast and Lightweight Model for Causal Audio-Visual Speech Separation
by: Sang, Wendi, et al.
Published: (2025) -
Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation
by: Rouditchenko, Andrew, et al.
Published: (2024)