Saved in:
| Main Authors: | Guan, Yiwen, Trinh, Viet Anh, Voleti, Vivek, Whitehill, Jacob |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2409.09221 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
MLLM-based Speech Recognition: When and How is Multimodality Beneficial?
by: Guan, Yiwen, et al.
Published: (2025)
by: Guan, Yiwen, et al.
Published: (2025)
AVE Speech: A Comprehensive Multi-Modal Dataset for Speech Recognition Integrating Audio, Visual, and Electromyographic Signals
by: Zhou, Dongliang, et al.
Published: (2025)
by: Zhou, Dongliang, et al.
Published: (2025)
MMSD-Net: Towards Multi-modal Stuttering Detection
by: Nie, Liangyu, et al.
Published: (2024)
by: Nie, Liangyu, et al.
Published: (2024)
Fretting-Transformer: Encoder-Decoder Model for MIDI to Tablature Transcription
by: Hamberger, Anna, et al.
Published: (2025)
by: Hamberger, Anna, et al.
Published: (2025)
Robust Dual-Modal Speech Keyword Spotting for XR Headsets
by: Cai, Zhuojiang, et al.
Published: (2024)
by: Cai, Zhuojiang, et al.
Published: (2024)
CommonVoice-SpeechRE and RPG-MoGe: Advancing Speech Relation Extraction with a New Dataset and Multi-Order Generative Framework
by: Ning, Jinzhong, et al.
Published: (2025)
by: Ning, Jinzhong, et al.
Published: (2025)
ARECHO: Autoregressive Evaluation via Chain-Based Hypothesis Optimization for Speech Multi-Metric Estimation
by: Shi, Jiatong, et al.
Published: (2025)
by: Shi, Jiatong, et al.
Published: (2025)
When End-to-End is Overkill: Rethinking Cascaded Speech-to-Text Translation
by: Min, Anna, et al.
Published: (2025)
by: Min, Anna, et al.
Published: (2025)
Double Mixture: Towards Continual Event Detection from Speech
by: Kang, Jingqi, et al.
Published: (2024)
by: Kang, Jingqi, et al.
Published: (2024)
Survey of End-to-End Multi-Speaker Automatic Speech Recognition for Monaural Audio
by: He, Xinlu, et al.
Published: (2025)
by: He, Xinlu, et al.
Published: (2025)
Discrete Multimodal Transformers with a Pretrained Large Language Model for Mixed-Supervision Speech Processing
by: Trinh, Viet Anh, et al.
Published: (2024)
by: Trinh, Viet Anh, et al.
Published: (2024)
ELEGANCE: Efficient LLM Guidance for Audio-Visual Target Speech Extraction
by: Wu, Wenxuan, et al.
Published: (2025)
by: Wu, Wenxuan, et al.
Published: (2025)
IIANet: An Intra- and Inter-Modality Attention Network for Audio-Visual Speech Separation
by: Li, Kai, et al.
Published: (2023)
by: Li, Kai, et al.
Published: (2023)
AlignVSR: Audio-Visual Cross-Modal Alignment for Visual Speech Recognition
by: Liu, Zehua, et al.
Published: (2024)
by: Liu, Zehua, et al.
Published: (2024)
Mechanisms of Multimodal Synchronization: Insights from Decoder-Based Video-Text-to-Speech Synthesis
by: Gupta, Akshita, et al.
Published: (2024)
by: Gupta, Akshita, et al.
Published: (2024)
MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix
by: Ma, Ziyang, et al.
Published: (2025)
by: Ma, Ziyang, et al.
Published: (2025)
M$^{3}$V: A multi-modal multi-view approach for Device-Directed Speech Detection
by: Wang, Anna, et al.
Published: (2024)
by: Wang, Anna, et al.
Published: (2024)
Audio-Thinker: Guiding Audio Language Model When and How to Think via Reinforcement Learning
by: Wu, Shu, et al.
Published: (2025)
by: Wu, Shu, et al.
Published: (2025)
Zero-Shot End-to-End Spoken Language Understanding via Cross-Modal Selective Self-Training
by: He, Jianfeng, et al.
Published: (2023)
by: He, Jianfeng, et al.
Published: (2023)
Pay More Attention To Audio: Mitigating Imbalance of Cross-Modal Attention in Large Audio Language Models
by: Wang, Junyu, et al.
Published: (2025)
by: Wang, Junyu, et al.
Published: (2025)
StarVC: A Unified Auto-Regressive Framework for Joint Text and Speech Generation in Voice Conversion
by: Li, Fengjin, et al.
Published: (2025)
by: Li, Fengjin, et al.
Published: (2025)
Speech Emotion Recognition with ASR Transcripts: A Comprehensive Study on Word Error Rate and Fusion Techniques
by: Li, Yuanchao, et al.
Published: (2024)
by: Li, Yuanchao, et al.
Published: (2024)
Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition
by: Radhakrishnan, Srijith, et al.
Published: (2023)
by: Radhakrishnan, Srijith, et al.
Published: (2023)
Efficient Video-to-Audio Generation via Multiple Foundation Models Mapper
by: Chen, Gehui, et al.
Published: (2025)
by: Chen, Gehui, et al.
Published: (2025)
MF-AED-AEC: Speech Emotion Recognition by Leveraging Multimodal Fusion, Asr Error Detection, and Asr Error Correction
by: He, Jiajun, et al.
Published: (2024)
by: He, Jiajun, et al.
Published: (2024)
MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual Transformers
by: Mahmud, Tanvir, et al.
Published: (2024)
by: Mahmud, Tanvir, et al.
Published: (2024)
Improving Lip-synchrony in Direct Audio-Visual Speech-to-Speech Translation
by: Goncalves, Lucas, et al.
Published: (2024)
by: Goncalves, Lucas, et al.
Published: (2024)
Beat and Downbeat Tracking in Performance MIDI Using an End-to-End Transformer Architecture
by: Murgul, Sebastian, et al.
Published: (2025)
by: Murgul, Sebastian, et al.
Published: (2025)
Robust LLM-based Audio-Visual Speech Recognition with Sparse Modality Alignment and Visual Unit-Guided Refinement
by: Su, Fei, et al.
Published: (2026)
by: Su, Fei, et al.
Published: (2026)
Efficient Audiovisual Speech Processing via MUTUD: Multimodal Training and Unimodal Deployment
by: Hong, Joanna, et al.
Published: (2025)
by: Hong, Joanna, et al.
Published: (2025)
Tri-Ergon: Fine-grained Video-to-Audio Generation with Multi-modal Conditions and LUFS Control
by: Li, Bingliang, et al.
Published: (2024)
by: Li, Bingliang, et al.
Published: (2024)
YingSound: Video-Guided Sound Effects Generation with Multi-modal Chain-of-Thought Controls
by: Chen, Zihao, et al.
Published: (2024)
by: Chen, Zihao, et al.
Published: (2024)
VERSA: A Versatile Evaluation Toolkit for Speech, Audio, and Music
by: Shi, Jiatong, et al.
Published: (2024)
by: Shi, Jiatong, et al.
Published: (2024)
M$^{2}$UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models
by: Liu, Shansong, et al.
Published: (2023)
by: Liu, Shansong, et al.
Published: (2023)
Improving Speech Enhancement by Integrating Inter-Channel and Band Features with Dual-branch Conformer
by: Li, Jizhen, et al.
Published: (2024)
by: Li, Jizhen, et al.
Published: (2024)
OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows
by: Li, Shufan, et al.
Published: (2024)
by: Li, Shufan, et al.
Published: (2024)
Bridging The Multi-Modality Gaps of Audio, Visual and Linguistic for Speech Enhancement
by: Lin, Meng-Ping, et al.
Published: (2025)
by: Lin, Meng-Ping, et al.
Published: (2025)
MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large Language Models
by: Liu, Shansong, et al.
Published: (2024)
by: Liu, Shansong, et al.
Published: (2024)
pyAMPACT: A Score-Audio Alignment Toolkit for Performance Data Estimation and Multi-modal Processing
by: Devaney, Johanna, et al.
Published: (2024)
by: Devaney, Johanna, et al.
Published: (2024)
Real-Time Word-Level Temporal Segmentation in Streaming Speech Recognition
by: Nishida, Naoto, et al.
Published: (2025)
by: Nishida, Naoto, et al.
Published: (2025)
Similar Items
-
MLLM-based Speech Recognition: When and How is Multimodality Beneficial?
by: Guan, Yiwen, et al.
Published: (2025) -
AVE Speech: A Comprehensive Multi-Modal Dataset for Speech Recognition Integrating Audio, Visual, and Electromyographic Signals
by: Zhou, Dongliang, et al.
Published: (2025) -
MMSD-Net: Towards Multi-modal Stuttering Detection
by: Nie, Liangyu, et al.
Published: (2024) -
Fretting-Transformer: Encoder-Decoder Model for MIDI to Tablature Transcription
by: Hamberger, Anna, et al.
Published: (2025) -
Robust Dual-Modal Speech Keyword Spotting for XR Headsets
by: Cai, Zhuojiang, et al.
Published: (2024)