Saved in:
| Main Authors: | Fernandez-Lopez, Adriana, Chen, Honglie, Ma, Pingchuan, Yin, Lu, Xiao, Qiao, Petridis, Stavros, Liu, Shiwei, Pantic, Maja |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2406.17614 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Large Language Models are Strong Audio-Visual Speech Recognition Learners
by: Cappellazzo, Umberto, et al.
Published: (2024)
by: Cappellazzo, Umberto, et al.
Published: (2024)
RT-LA-VocE: Real-Time Low-SNR Audio-Visual Speech Enhancement
by: Chen, Honglie, et al.
Published: (2024)
by: Chen, Honglie, et al.
Published: (2024)
Full-Rank No More: Low-Rank Weight Training for Modern Speech Recognition Models
by: Fernandez-Lopez, Adriana, et al.
Published: (2024)
by: Fernandez-Lopez, Adriana, et al.
Published: (2024)
Revival with Voice: Multi-modal Controllable Text-to-Speech Synthesis
by: Kim, Minsu, et al.
Published: (2025)
by: Kim, Minsu, et al.
Published: (2025)
MoME: Mixture of Matryoshka Experts for Audio-Visual Speech Recognition
by: Cappellazzo, Umberto, et al.
Published: (2025)
by: Cappellazzo, Umberto, et al.
Published: (2025)
Dynamic Data Pruning for Automatic Speech Recognition
by: Xiao, Qiao, et al.
Published: (2024)
by: Xiao, Qiao, et al.
Published: (2024)
Omni-AVSR: Towards Unified Multimodal Speech Recognition with Large Language Models
by: Cappellazzo, Umberto, et al.
Published: (2025)
by: Cappellazzo, Umberto, et al.
Published: (2025)
Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs
by: Cappellazzo, Umberto, et al.
Published: (2025)
by: Cappellazzo, Umberto, et al.
Published: (2025)
Contextual Speech Extraction: Leveraging Textual History as an Implicit Cue for Target Speech Extraction
by: Kim, Minsu, et al.
Published: (2025)
by: Kim, Minsu, et al.
Published: (2025)
Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs
by: Haliassos, Alexandros, et al.
Published: (2024)
by: Haliassos, Alexandros, et al.
Published: (2024)
Dr. SHAP-AV: Decoding Relative Modality Contributions via Shapley Attribution in Audio-Visual Speech Recognition
by: Cappellazzo, Umberto, et al.
Published: (2026)
by: Cappellazzo, Umberto, et al.
Published: (2026)
Mitigating Attention Sinks and Massive Activations in Audio-Visual Speech Recognition with LLMs
by: Anand, et al.
Published: (2025)
by: Anand, et al.
Published: (2025)
Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations
by: Yeo, Jeong Hun, et al.
Published: (2025)
by: Yeo, Jeong Hun, et al.
Published: (2025)
Efficient Audiovisual Speech Processing via MUTUD: Multimodal Training and Unimodal Deployment
by: Hong, Joanna, et al.
Published: (2025)
by: Hong, Joanna, et al.
Published: (2025)
Scaling and Enhancing LLM-based AVSR: A Sparse Mixture of Projectors Approach
by: Cappellazzo, Umberto, et al.
Published: (2025)
by: Cappellazzo, Umberto, et al.
Published: (2025)
BRAVEn: Improving Self-Supervised Pre-training for Visual and Auditory Speech Recognition
by: Haliassos, Alexandros, et al.
Published: (2024)
by: Haliassos, Alexandros, et al.
Published: (2024)
Angle-Optimized Partial Disentanglement for Multimodal Emotion Recognition in Conversation
by: Che, Xinyi, et al.
Published: (2025)
by: Che, Xinyi, et al.
Published: (2025)
MSAC: Multiple Speech Attribute Control Method for Reliable Speech Emotion Recognition
by: Pan, Yu, et al.
Published: (2023)
by: Pan, Yu, et al.
Published: (2023)
Purification Before Fusion: Toward Mask-Free Speech Enhancement for Robust Audio-Visual Speech Recognition
by: Wu, Linzhi, et al.
Published: (2026)
by: Wu, Linzhi, et al.
Published: (2026)
FreeMask: Rethinking the Importance of Attention Masks for Zero-Shot Video Editing
by: Cai, Lingling, et al.
Published: (2024)
by: Cai, Lingling, et al.
Published: (2024)
Controllable Text-to-Speech Synthesis with Masked-Autoencoded Style-Rich Representation
by: Wang, Yongqi, et al.
Published: (2025)
by: Wang, Yongqi, et al.
Published: (2025)
State-Anchored Complete-View Distillation for Robust Conversational Multimodal Emotion Recognition
by: Pan, Zhaoyan, et al.
Published: (2026)
by: Pan, Zhaoyan, et al.
Published: (2026)
SVLA: A Unified Speech-Vision-Language Assistant with Multimodal Reasoning and Speech Generation
by: Huynh, Ngoc Dung, et al.
Published: (2025)
by: Huynh, Ngoc Dung, et al.
Published: (2025)
Multimodal Emotion Recognition with Large Language Models
by: Zhang, Hongrui, et al.
Published: (2026)
by: Zhang, Hongrui, et al.
Published: (2026)
SpikEmo: Enhancing Emotion Recognition With Spiking Temporal Dynamics in Conversations
by: Yu, Xiaomin, et al.
Published: (2024)
by: Yu, Xiaomin, et al.
Published: (2024)
MLLM-based Speech Recognition: When and How is Multimodality Beneficial?
by: Guan, Yiwen, et al.
Published: (2025)
by: Guan, Yiwen, et al.
Published: (2025)
TARQ: Tail-Aware Reconstruction Quantization for Rare-Word Robust Automatic Speech Recognition
by: Wang, Xinyu, et al.
Published: (2026)
by: Wang, Xinyu, et al.
Published: (2026)
Robust LLM-based Audio-Visual Speech Recognition with Sparse Modality Alignment and Visual Unit-Guided Refinement
by: Su, Fei, et al.
Published: (2026)
by: Su, Fei, et al.
Published: (2026)
Multimodal Learned Sparse Retrieval for Image Suggestion
by: Nguyen, Thong, et al.
Published: (2024)
by: Nguyen, Thong, et al.
Published: (2024)
Evolutionary Multimodal Reasoning via Hierarchical Semantic Representation for Intent Recognition
by: Zhou, Qianrui, et al.
Published: (2026)
by: Zhou, Qianrui, et al.
Published: (2026)
Orthogonal Disentanglement with Projected Feature Alignment for Multimodal Emotion Recognition in Conversation
by: Che, Xinyi, et al.
Published: (2025)
by: Che, Xinyi, et al.
Published: (2025)
Multimodal Fusion via Hypergraph Autoencoder and Contrastive Learning for Emotion Recognition in Conversation
by: Yi, Zijian, et al.
Published: (2024)
by: Yi, Zijian, et al.
Published: (2024)
Mitigating Multimodal Inconsistency via Cognitive Dual-Pathway Reasoning for Intent Recognition
by: Wang, Yifan, et al.
Published: (2026)
by: Wang, Yifan, et al.
Published: (2026)
Ada2I: Enhancing Modality Balance for Multimodal Conversational Emotion Recognition
by: Nguyen, Cam-Van Thi, et al.
Published: (2024)
by: Nguyen, Cam-Van Thi, et al.
Published: (2024)
Explainable Multimodal Emotion Recognition
by: Lian, Zheng, et al.
Published: (2023)
by: Lian, Zheng, et al.
Published: (2023)
Target Speech Diarization with Multimodal Prompts
by: Jiang, Yidi, et al.
Published: (2024)
by: Jiang, Yidi, et al.
Published: (2024)
LipGen: Viseme-Guided Lip Video Generation for Enhancing Visual Speech Recognition
by: Hao, Bowen, et al.
Published: (2025)
by: Hao, Bowen, et al.
Published: (2025)
Towards Multimodal Empathetic Response Generation: A Rich Text-Speech-Vision Avatar-based Benchmark
by: Zhang, Han, et al.
Published: (2025)
by: Zhang, Han, et al.
Published: (2025)
Private Speech Classification without Collapse: Stabilized DP Training and Offline Distillation
by: Wen, Yadi, et al.
Published: (2026)
by: Wen, Yadi, et al.
Published: (2026)
Multimodal Methods for Analyzing Learning and Training Environments: A Systematic Literature Review
by: Cohn, Clayton, et al.
Published: (2024)
by: Cohn, Clayton, et al.
Published: (2024)
Similar Items
-
Large Language Models are Strong Audio-Visual Speech Recognition Learners
by: Cappellazzo, Umberto, et al.
Published: (2024) -
RT-LA-VocE: Real-Time Low-SNR Audio-Visual Speech Enhancement
by: Chen, Honglie, et al.
Published: (2024) -
Full-Rank No More: Low-Rank Weight Training for Modern Speech Recognition Models
by: Fernandez-Lopez, Adriana, et al.
Published: (2024) -
Revival with Voice: Multi-modal Controllable Text-to-Speech Synthesis
by: Kim, Minsu, et al.
Published: (2025) -
MoME: Mixture of Matryoshka Experts for Audio-Visual Speech Recognition
by: Cappellazzo, Umberto, et al.
Published: (2025)