:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Guan, Yiwen, Trinh, Viet Anh, Voleti, Vivek, Whitehill, Jacob
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Computation and Language Multimedia Sound Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2409.09221
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

MLLM-based Speech Recognition: When and How is Multimodality Beneficial?
by: Guan, Yiwen, et al.
Published: (2025)

AVE Speech: A Comprehensive Multi-Modal Dataset for Speech Recognition Integrating Audio, Visual, and Electromyographic Signals
by: Zhou, Dongliang, et al.
Published: (2025)

MMSD-Net: Towards Multi-modal Stuttering Detection
by: Nie, Liangyu, et al.
Published: (2024)

Fretting-Transformer: Encoder-Decoder Model for MIDI to Tablature Transcription
by: Hamberger, Anna, et al.
Published: (2025)

Robust Dual-Modal Speech Keyword Spotting for XR Headsets
by: Cai, Zhuojiang, et al.
Published: (2024)

CommonVoice-SpeechRE and RPG-MoGe: Advancing Speech Relation Extraction with a New Dataset and Multi-Order Generative Framework
by: Ning, Jinzhong, et al.
Published: (2025)

ARECHO: Autoregressive Evaluation via Chain-Based Hypothesis Optimization for Speech Multi-Metric Estimation
by: Shi, Jiatong, et al.
Published: (2025)

When End-to-End is Overkill: Rethinking Cascaded Speech-to-Text Translation
by: Min, Anna, et al.
Published: (2025)

Double Mixture: Towards Continual Event Detection from Speech
by: Kang, Jingqi, et al.
Published: (2024)

Survey of End-to-End Multi-Speaker Automatic Speech Recognition for Monaural Audio
by: He, Xinlu, et al.
Published: (2025)

Discrete Multimodal Transformers with a Pretrained Large Language Model for Mixed-Supervision Speech Processing
by: Trinh, Viet Anh, et al.
Published: (2024)

ELEGANCE: Efficient LLM Guidance for Audio-Visual Target Speech Extraction
by: Wu, Wenxuan, et al.
Published: (2025)

IIANet: An Intra- and Inter-Modality Attention Network for Audio-Visual Speech Separation
by: Li, Kai, et al.
Published: (2023)

AlignVSR: Audio-Visual Cross-Modal Alignment for Visual Speech Recognition
by: Liu, Zehua, et al.
Published: (2024)

Mechanisms of Multimodal Synchronization: Insights from Decoder-Based Video-Text-to-Speech Synthesis
by: Gupta, Akshita, et al.
Published: (2024)

MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix
by: Ma, Ziyang, et al.
Published: (2025)

M$^{3}$V: A multi-modal multi-view approach for Device-Directed Speech Detection
by: Wang, Anna, et al.
Published: (2024)

Audio-Thinker: Guiding Audio Language Model When and How to Think via Reinforcement Learning
by: Wu, Shu, et al.
Published: (2025)

Zero-Shot End-to-End Spoken Language Understanding via Cross-Modal Selective Self-Training
by: He, Jianfeng, et al.
Published: (2023)

Pay More Attention To Audio: Mitigating Imbalance of Cross-Modal Attention in Large Audio Language Models
by: Wang, Junyu, et al.
Published: (2025)

StarVC: A Unified Auto-Regressive Framework for Joint Text and Speech Generation in Voice Conversion
by: Li, Fengjin, et al.
Published: (2025)

Speech Emotion Recognition with ASR Transcripts: A Comprehensive Study on Word Error Rate and Fusion Techniques
by: Li, Yuanchao, et al.
Published: (2024)

Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition
by: Radhakrishnan, Srijith, et al.
Published: (2023)

Efficient Video-to-Audio Generation via Multiple Foundation Models Mapper
by: Chen, Gehui, et al.
Published: (2025)

MF-AED-AEC: Speech Emotion Recognition by Leveraging Multimodal Fusion, Asr Error Detection, and Asr Error Correction
by: He, Jiajun, et al.
Published: (2024)

MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual Transformers
by: Mahmud, Tanvir, et al.
Published: (2024)

Improving Lip-synchrony in Direct Audio-Visual Speech-to-Speech Translation
by: Goncalves, Lucas, et al.
Published: (2024)

Beat and Downbeat Tracking in Performance MIDI Using an End-to-End Transformer Architecture
by: Murgul, Sebastian, et al.
Published: (2025)

Robust LLM-based Audio-Visual Speech Recognition with Sparse Modality Alignment and Visual Unit-Guided Refinement
by: Su, Fei, et al.
Published: (2026)

Efficient Audiovisual Speech Processing via MUTUD: Multimodal Training and Unimodal Deployment
by: Hong, Joanna, et al.
Published: (2025)

Tri-Ergon: Fine-grained Video-to-Audio Generation with Multi-modal Conditions and LUFS Control
by: Li, Bingliang, et al.
Published: (2024)

YingSound: Video-Guided Sound Effects Generation with Multi-modal Chain-of-Thought Controls
by: Chen, Zihao, et al.
Published: (2024)

VERSA: A Versatile Evaluation Toolkit for Speech, Audio, and Music
by: Shi, Jiatong, et al.
Published: (2024)

M$^{2}$UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models
by: Liu, Shansong, et al.
Published: (2023)

Improving Speech Enhancement by Integrating Inter-Channel and Band Features with Dual-branch Conformer
by: Li, Jizhen, et al.
Published: (2024)

OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows
by: Li, Shufan, et al.
Published: (2024)

Bridging The Multi-Modality Gaps of Audio, Visual and Linguistic for Speech Enhancement
by: Lin, Meng-Ping, et al.
Published: (2025)

MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large Language Models
by: Liu, Shansong, et al.
Published: (2024)

pyAMPACT: A Score-Audio Alignment Toolkit for Performance Data Estimation and Multi-modal Processing
by: Devaney, Johanna, et al.
Published: (2024)

Real-Time Word-Level Temporal Segmentation in Streaming Speech Recognition
by: Nishida, Naoto, et al.
Published: (2025)