:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Fernandez-Lopez, Adriana, Chen, Honglie, Ma, Pingchuan, Yin, Lu, Xiao, Qiao, Petridis, Stavros, Liu, Shiwei, Pantic, Maja
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Multimedia
Online Access:	https://arxiv.org/abs/2406.17614
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Large Language Models are Strong Audio-Visual Speech Recognition Learners
by: Cappellazzo, Umberto, et al.
Published: (2024)

RT-LA-VocE: Real-Time Low-SNR Audio-Visual Speech Enhancement
by: Chen, Honglie, et al.
Published: (2024)

Full-Rank No More: Low-Rank Weight Training for Modern Speech Recognition Models
by: Fernandez-Lopez, Adriana, et al.
Published: (2024)

Revival with Voice: Multi-modal Controllable Text-to-Speech Synthesis
by: Kim, Minsu, et al.
Published: (2025)

MoME: Mixture of Matryoshka Experts for Audio-Visual Speech Recognition
by: Cappellazzo, Umberto, et al.
Published: (2025)

Dynamic Data Pruning for Automatic Speech Recognition
by: Xiao, Qiao, et al.
Published: (2024)

Omni-AVSR: Towards Unified Multimodal Speech Recognition with Large Language Models
by: Cappellazzo, Umberto, et al.
Published: (2025)

Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs
by: Cappellazzo, Umberto, et al.
Published: (2025)

Contextual Speech Extraction: Leveraging Textual History as an Implicit Cue for Target Speech Extraction
by: Kim, Minsu, et al.
Published: (2025)

Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs
by: Haliassos, Alexandros, et al.
Published: (2024)

Dr. SHAP-AV: Decoding Relative Modality Contributions via Shapley Attribution in Audio-Visual Speech Recognition
by: Cappellazzo, Umberto, et al.
Published: (2026)

Mitigating Attention Sinks and Massive Activations in Audio-Visual Speech Recognition with LLMs
by: Anand, et al.
Published: (2025)

Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations
by: Yeo, Jeong Hun, et al.
Published: (2025)

Efficient Audiovisual Speech Processing via MUTUD: Multimodal Training and Unimodal Deployment
by: Hong, Joanna, et al.
Published: (2025)

Scaling and Enhancing LLM-based AVSR: A Sparse Mixture of Projectors Approach
by: Cappellazzo, Umberto, et al.
Published: (2025)

BRAVEn: Improving Self-Supervised Pre-training for Visual and Auditory Speech Recognition
by: Haliassos, Alexandros, et al.
Published: (2024)

Angle-Optimized Partial Disentanglement for Multimodal Emotion Recognition in Conversation
by: Che, Xinyi, et al.
Published: (2025)

MSAC: Multiple Speech Attribute Control Method for Reliable Speech Emotion Recognition
by: Pan, Yu, et al.
Published: (2023)

Purification Before Fusion: Toward Mask-Free Speech Enhancement for Robust Audio-Visual Speech Recognition
by: Wu, Linzhi, et al.
Published: (2026)

FreeMask: Rethinking the Importance of Attention Masks for Zero-Shot Video Editing
by: Cai, Lingling, et al.
Published: (2024)

Controllable Text-to-Speech Synthesis with Masked-Autoencoded Style-Rich Representation
by: Wang, Yongqi, et al.
Published: (2025)

State-Anchored Complete-View Distillation for Robust Conversational Multimodal Emotion Recognition
by: Pan, Zhaoyan, et al.
Published: (2026)

SVLA: A Unified Speech-Vision-Language Assistant with Multimodal Reasoning and Speech Generation
by: Huynh, Ngoc Dung, et al.
Published: (2025)

Multimodal Emotion Recognition with Large Language Models
by: Zhang, Hongrui, et al.
Published: (2026)

SpikEmo: Enhancing Emotion Recognition With Spiking Temporal Dynamics in Conversations
by: Yu, Xiaomin, et al.
Published: (2024)

MLLM-based Speech Recognition: When and How is Multimodality Beneficial?
by: Guan, Yiwen, et al.
Published: (2025)

TARQ: Tail-Aware Reconstruction Quantization for Rare-Word Robust Automatic Speech Recognition
by: Wang, Xinyu, et al.
Published: (2026)

Robust LLM-based Audio-Visual Speech Recognition with Sparse Modality Alignment and Visual Unit-Guided Refinement
by: Su, Fei, et al.
Published: (2026)

Multimodal Learned Sparse Retrieval for Image Suggestion
by: Nguyen, Thong, et al.
Published: (2024)

Evolutionary Multimodal Reasoning via Hierarchical Semantic Representation for Intent Recognition
by: Zhou, Qianrui, et al.
Published: (2026)

Orthogonal Disentanglement with Projected Feature Alignment for Multimodal Emotion Recognition in Conversation
by: Che, Xinyi, et al.
Published: (2025)

Multimodal Fusion via Hypergraph Autoencoder and Contrastive Learning for Emotion Recognition in Conversation
by: Yi, Zijian, et al.
Published: (2024)

Mitigating Multimodal Inconsistency via Cognitive Dual-Pathway Reasoning for Intent Recognition
by: Wang, Yifan, et al.
Published: (2026)

Ada2I: Enhancing Modality Balance for Multimodal Conversational Emotion Recognition
by: Nguyen, Cam-Van Thi, et al.
Published: (2024)

Explainable Multimodal Emotion Recognition
by: Lian, Zheng, et al.
Published: (2023)

Target Speech Diarization with Multimodal Prompts
by: Jiang, Yidi, et al.
Published: (2024)

LipGen: Viseme-Guided Lip Video Generation for Enhancing Visual Speech Recognition
by: Hao, Bowen, et al.
Published: (2025)

Towards Multimodal Empathetic Response Generation: A Rich Text-Speech-Vision Avatar-based Benchmark
by: Zhang, Han, et al.
Published: (2025)

Private Speech Classification without Collapse: Stabilized DP Training and Offline Distillation
by: Wen, Yadi, et al.
Published: (2026)

Multimodal Methods for Analyzing Learning and Training Environments: A Systematic Literature Review
by: Cohn, Clayton, et al.
Published: (2024)