Saved in:
| Main Authors: | Wang, Siyin, Yang, Chao-Han Huck, Wu, Ji, Zhang, Chao |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2404.14716 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Can Whisper perform speech-based in-context learning?
by: Wang, Siyin, et al.
Published: (2023)
by: Wang, Siyin, et al.
Published: (2023)
Audio Large Language Models Can Be Descriptive Speech Quality Evaluators
by: Chen, Chen, et al.
Published: (2025)
by: Chen, Chen, et al.
Published: (2025)
Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization
by: Hu, Yuchen, et al.
Published: (2024)
by: Hu, Yuchen, et al.
Published: (2024)
Benchmarking Japanese Speech Recognition on ASR-LLM Setups with Multi-Pass Augmented Generative Error Correction
by: Ko, Yuka, et al.
Published: (2024)
by: Ko, Yuka, et al.
Published: (2024)
Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model
by: Xue, Jinlong, et al.
Published: (2024)
by: Xue, Jinlong, et al.
Published: (2024)
Improving Lip-synchrony in Direct Audio-Visual Speech-to-Speech Translation
by: Goncalves, Lucas, et al.
Published: (2024)
by: Goncalves, Lucas, et al.
Published: (2024)
Evolutionary Prompt Design for LLM-Based Post-ASR Error Correction
by: Sachdev, Rithik, et al.
Published: (2024)
by: Sachdev, Rithik, et al.
Published: (2024)
Multi-modal Speech Transformer Decoders: When Do Multiple Modalities Improve Accuracy?
by: Guan, Yiwen, et al.
Published: (2024)
by: Guan, Yiwen, et al.
Published: (2024)
DeSTA2: Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data
by: Lu, Ke-Han, et al.
Published: (2024)
by: Lu, Ke-Han, et al.
Published: (2024)
UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation
by: Liu, Alexander H., et al.
Published: (2025)
by: Liu, Alexander H., et al.
Published: (2025)
ESPnet-SpeechLM: An Open Speech Language Model Toolkit
by: Tian, Jinchuan, et al.
Published: (2025)
by: Tian, Jinchuan, et al.
Published: (2025)
Enhancing Zero-shot Text-to-Speech Synthesis with Human Feedback
by: Chen, Chen, et al.
Published: (2024)
by: Chen, Chen, et al.
Published: (2024)
Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation
by: Wang, Siyin, et al.
Published: (2024)
by: Wang, Siyin, et al.
Published: (2024)
Face-StyleSpeech: Enhancing Zero-shot Speech Synthesis from Face Images with Improved Face-to-Speech Mapping
by: Kang, Minki, et al.
Published: (2023)
by: Kang, Minki, et al.
Published: (2023)
Detecting the Undetectable: Assessing the Efficacy of Current Spoof Detection Methods Against Seamless Speech Edits
by: Huang, Sung-Feng, et al.
Published: (2025)
by: Huang, Sung-Feng, et al.
Published: (2025)
Robust Audiovisual Speech Recognition Models with Mixture-of-Experts
by: Wu, Yihan, et al.
Published: (2024)
by: Wu, Yihan, et al.
Published: (2024)
Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition
by: Radhakrishnan, Srijith, et al.
Published: (2023)
by: Radhakrishnan, Srijith, et al.
Published: (2023)
IIANet: An Intra- and Inter-Modality Attention Network for Audio-Visual Speech Separation
by: Li, Kai, et al.
Published: (2023)
by: Li, Kai, et al.
Published: (2023)
HarmoniFuse: A Component-Selective and Prompt-Adaptive Framework for Multi-Task Speech Language Modeling
by: Si, Yuke, et al.
Published: (2025)
by: Si, Yuke, et al.
Published: (2025)
CoLM-DSR: Leveraging Neural Codec Language Modeling for Multi-Modal Dysarthric Speech Reconstruction
by: Chen, Xueyuan, et al.
Published: (2024)
by: Chen, Xueyuan, et al.
Published: (2024)
Bridging Ears and Eyes: Analyzing Audio and Visual Large Language Models to Humans in Visible Sound Recognition and Reducing Their Sensory Gap via Cross-Modal Distillation
by: Jiang, Xilin, et al.
Published: (2025)
by: Jiang, Xilin, et al.
Published: (2025)
QualiSpeech: A Speech Quality Assessment Dataset with Natural Language Reasoning and Descriptions
by: Wang, Siyin, et al.
Published: (2025)
by: Wang, Siyin, et al.
Published: (2025)
AlignVSR: Audio-Visual Cross-Modal Alignment for Visual Speech Recognition
by: Liu, Zehua, et al.
Published: (2024)
by: Liu, Zehua, et al.
Published: (2024)
Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation
by: Tan, Weiting, et al.
Published: (2025)
by: Tan, Weiting, et al.
Published: (2025)
Self-Powered LLM Modality Expansion for Large Speech-Text Models
by: Yu, Tengfei, et al.
Published: (2024)
by: Yu, Tengfei, et al.
Published: (2024)
Spontaneous Speech-Based Suicide Risk Detection Using Whisper and Large Language Models
by: Cui, Ziyun, et al.
Published: (2024)
by: Cui, Ziyun, et al.
Published: (2024)
Emphasis Rendering for Conversational Text-to-Speech with Multi-modal Multi-scale Context Modeling
by: Liu, Rui, et al.
Published: (2024)
by: Liu, Rui, et al.
Published: (2024)
SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation
by: Yu, Wenyi, et al.
Published: (2024)
by: Yu, Wenyi, et al.
Published: (2024)
Dr. SHAP-AV: Decoding Relative Modality Contributions via Shapley Attribution in Audio-Visual Speech Recognition
by: Cappellazzo, Umberto, et al.
Published: (2026)
by: Cappellazzo, Umberto, et al.
Published: (2026)
Noise-Robust AV-ASR Using Visual Features Both in the Whisper Encoder and Decoder
by: Li, Zhengyang, et al.
Published: (2026)
by: Li, Zhengyang, et al.
Published: (2026)
JEP-KD: Joint-Embedding Predictive Architecture Based Knowledge Distillation for Visual Speech Recognition
by: Sun, Chang, et al.
Published: (2024)
by: Sun, Chang, et al.
Published: (2024)
Leveraging Large Language Models in Visual Speech Recognition: Model Scaling, Context-Aware Decoding, and Iterative Polishing
by: Liu, Zehua, et al.
Published: (2025)
by: Liu, Zehua, et al.
Published: (2025)
TI-ASU: Toward Robust Automatic Speech Understanding through Text-to-speech Imputation Against Missing Speech Modality
by: Feng, Tiantian, et al.
Published: (2024)
by: Feng, Tiantian, et al.
Published: (2024)
Improving Accented Speech Recognition using Data Augmentation based on Unsupervised Text-to-Speech Synthesis
by: Do, Cong-Thanh, et al.
Published: (2024)
by: Do, Cong-Thanh, et al.
Published: (2024)
Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation
by: Kim, Minsu, et al.
Published: (2024)
by: Kim, Minsu, et al.
Published: (2024)
Enhancing Generalization of Speech Large Language Models with Multi-Task Behavior Imitation and Speech-Text Interleaving
by: Xie, Jingran, et al.
Published: (2025)
by: Xie, Jingran, et al.
Published: (2025)
Improving Speech Emotion Recognition in Under-Resourced Languages via Speech-to-Speech Translation with Bootstrapping Data Selection
by: Lin, Hsi-Che, et al.
Published: (2024)
by: Lin, Hsi-Che, et al.
Published: (2024)
A Unit-based System and Dataset for Expressive Direct Speech-to-Speech Translation
by: Min, Anna, et al.
Published: (2025)
by: Min, Anna, et al.
Published: (2025)
PolySpeech: Exploring Unified Multitask Speech Models for Competitiveness with Single-task Models
by: Yang, Runyan, et al.
Published: (2024)
by: Yang, Runyan, et al.
Published: (2024)
XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception
by: Han, HyoJung, et al.
Published: (2024)
by: Han, HyoJung, et al.
Published: (2024)
Similar Items
-
Can Whisper perform speech-based in-context learning?
by: Wang, Siyin, et al.
Published: (2023) -
Audio Large Language Models Can Be Descriptive Speech Quality Evaluators
by: Chen, Chen, et al.
Published: (2025) -
Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization
by: Hu, Yuchen, et al.
Published: (2024) -
Benchmarking Japanese Speech Recognition on ASR-LLM Setups with Multi-Pass Augmented Generative Error Correction
by: Ko, Yuka, et al.
Published: (2024) -
Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model
by: Xue, Jinlong, et al.
Published: (2024)