:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Wang, Siyin, Yang, Chao-Han Huck, Wu, Ji, Zhang, Chao
Format:	Preprint
Published:	2024
Subjects:	Computation and Language Artificial Intelligence Computer Vision and Pattern Recognition Sound Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2404.14716
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Can Whisper perform speech-based in-context learning?
by: Wang, Siyin, et al.
Published: (2023)

Audio Large Language Models Can Be Descriptive Speech Quality Evaluators
by: Chen, Chen, et al.
Published: (2025)

Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization
by: Hu, Yuchen, et al.
Published: (2024)

Benchmarking Japanese Speech Recognition on ASR-LLM Setups with Multi-Pass Augmented Generative Error Correction
by: Ko, Yuka, et al.
Published: (2024)

Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model
by: Xue, Jinlong, et al.
Published: (2024)

Improving Lip-synchrony in Direct Audio-Visual Speech-to-Speech Translation
by: Goncalves, Lucas, et al.
Published: (2024)

Evolutionary Prompt Design for LLM-Based Post-ASR Error Correction
by: Sachdev, Rithik, et al.
Published: (2024)

Multi-modal Speech Transformer Decoders: When Do Multiple Modalities Improve Accuracy?
by: Guan, Yiwen, et al.
Published: (2024)

DeSTA2: Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data
by: Lu, Ke-Han, et al.
Published: (2024)

UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation
by: Liu, Alexander H., et al.
Published: (2025)

ESPnet-SpeechLM: An Open Speech Language Model Toolkit
by: Tian, Jinchuan, et al.
Published: (2025)

Enhancing Zero-shot Text-to-Speech Synthesis with Human Feedback
by: Chen, Chen, et al.
Published: (2024)

Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation
by: Wang, Siyin, et al.
Published: (2024)

Face-StyleSpeech: Enhancing Zero-shot Speech Synthesis from Face Images with Improved Face-to-Speech Mapping
by: Kang, Minki, et al.
Published: (2023)

Detecting the Undetectable: Assessing the Efficacy of Current Spoof Detection Methods Against Seamless Speech Edits
by: Huang, Sung-Feng, et al.
Published: (2025)

Robust Audiovisual Speech Recognition Models with Mixture-of-Experts
by: Wu, Yihan, et al.
Published: (2024)

Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition
by: Radhakrishnan, Srijith, et al.
Published: (2023)

IIANet: An Intra- and Inter-Modality Attention Network for Audio-Visual Speech Separation
by: Li, Kai, et al.
Published: (2023)

HarmoniFuse: A Component-Selective and Prompt-Adaptive Framework for Multi-Task Speech Language Modeling
by: Si, Yuke, et al.
Published: (2025)

CoLM-DSR: Leveraging Neural Codec Language Modeling for Multi-Modal Dysarthric Speech Reconstruction
by: Chen, Xueyuan, et al.
Published: (2024)

Bridging Ears and Eyes: Analyzing Audio and Visual Large Language Models to Humans in Visible Sound Recognition and Reducing Their Sensory Gap via Cross-Modal Distillation
by: Jiang, Xilin, et al.
Published: (2025)

QualiSpeech: A Speech Quality Assessment Dataset with Natural Language Reasoning and Descriptions
by: Wang, Siyin, et al.
Published: (2025)

AlignVSR: Audio-Visual Cross-Modal Alignment for Visual Speech Recognition
by: Liu, Zehua, et al.
Published: (2024)

Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation
by: Tan, Weiting, et al.
Published: (2025)

Self-Powered LLM Modality Expansion for Large Speech-Text Models
by: Yu, Tengfei, et al.
Published: (2024)

Spontaneous Speech-Based Suicide Risk Detection Using Whisper and Large Language Models
by: Cui, Ziyun, et al.
Published: (2024)

Emphasis Rendering for Conversational Text-to-Speech with Multi-modal Multi-scale Context Modeling
by: Liu, Rui, et al.
Published: (2024)

SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation
by: Yu, Wenyi, et al.
Published: (2024)

Dr. SHAP-AV: Decoding Relative Modality Contributions via Shapley Attribution in Audio-Visual Speech Recognition
by: Cappellazzo, Umberto, et al.
Published: (2026)

Noise-Robust AV-ASR Using Visual Features Both in the Whisper Encoder and Decoder
by: Li, Zhengyang, et al.
Published: (2026)

JEP-KD: Joint-Embedding Predictive Architecture Based Knowledge Distillation for Visual Speech Recognition
by: Sun, Chang, et al.
Published: (2024)

Leveraging Large Language Models in Visual Speech Recognition: Model Scaling, Context-Aware Decoding, and Iterative Polishing
by: Liu, Zehua, et al.
Published: (2025)

TI-ASU: Toward Robust Automatic Speech Understanding through Text-to-speech Imputation Against Missing Speech Modality
by: Feng, Tiantian, et al.
Published: (2024)

Improving Accented Speech Recognition using Data Augmentation based on Unsupervised Text-to-Speech Synthesis
by: Do, Cong-Thanh, et al.
Published: (2024)

Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation
by: Kim, Minsu, et al.
Published: (2024)

Enhancing Generalization of Speech Large Language Models with Multi-Task Behavior Imitation and Speech-Text Interleaving
by: Xie, Jingran, et al.
Published: (2025)

Improving Speech Emotion Recognition in Under-Resourced Languages via Speech-to-Speech Translation with Bootstrapping Data Selection
by: Lin, Hsi-Che, et al.
Published: (2024)

A Unit-based System and Dataset for Expressive Direct Speech-to-Speech Translation
by: Min, Anna, et al.
Published: (2025)

PolySpeech: Exploring Unified Multitask Speech Models for Competitiveness with Single-task Models
by: Yang, Runyan, et al.
Published: (2024)

XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception
by: Han, HyoJung, et al.
Published: (2024)