Saved in:
| Main Authors: | Lu, Yichen, Dai, Wei, Liu, Jiaen, Kwok, Ching Wing, Wu, Zongheng, Xiao, Xudong, Sun, Ao, Fu, Sheng, Zhan, Jianyuan, Wang, Yian, Saito, Takatomo, Lai, Sicheng |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2507.07306 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Applying LLMs for Rescoring N-best ASR Hypotheses of Casual Conversations: Effects of Domain Adaptation and Context Carry-over
by: Ogawa, Atsunori, et al.
Published: (2024)
by: Ogawa, Atsunori, et al.
Published: (2024)
Bridging the gap: A comparative exploration of Speech-LLM and end-to-end architecture for multilingual conversational ASR
by: Mei, Yuxiang, et al.
Published: (2026)
by: Mei, Yuxiang, et al.
Published: (2026)
Hybrid-Sep: Language-queried audio source separation via pre-trained Model Fusion and Adversarial Diffusion Training
by: Feng, Jianyuan, et al.
Published: (2025)
by: Feng, Jianyuan, et al.
Published: (2025)
Sentence-wise Speech Summarization: Task, Datasets, and End-to-End Modeling with LM Knowledge Distillation
by: Matsuura, Kohei, et al.
Published: (2024)
by: Matsuura, Kohei, et al.
Published: (2024)
Long-Context Speech Synthesis with Context-Aware Memory
by: Li, Zhipeng, et al.
Published: (2025)
by: Li, Zhipeng, et al.
Published: (2025)
Training Data Augmentation for Dysarthric Automatic Speech Recognition by Text-to-Dysarthric-Speech Synthesis
by: Leung, Wing-Zin, et al.
Published: (2024)
by: Leung, Wing-Zin, et al.
Published: (2024)
Voice Conversion Augmentation for Speaker Recognition on Defective Datasets
by: Tao, Ruijie, et al.
Published: (2024)
by: Tao, Ruijie, et al.
Published: (2024)
Seeing the Context: Rich Visual Context-Aware Speech Recognition via Multimodal Reasoning
by: Tian, Wenjie, et al.
Published: (2026)
by: Tian, Wenjie, et al.
Published: (2026)
EMMeTT: Efficient Multimodal Machine Translation Training
by: Żelasko, Piotr, et al.
Published: (2024)
by: Żelasko, Piotr, et al.
Published: (2024)
AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations
by: Wu, Sheng, et al.
Published: (2024)
by: Wu, Sheng, et al.
Published: (2024)
Multimodal Consistency-Guided Reference-Free Data Selection for ASR Accent Adaptation
by: Lei, Ligong, et al.
Published: (2026)
by: Lei, Ligong, et al.
Published: (2026)
Autonomous Soundscape Augmentation with Multimodal Fusion of Visual and Participant-linked Inputs
by: Ooi, Kenneth, et al.
Published: (2023)
by: Ooi, Kenneth, et al.
Published: (2023)
ViT-TTS: Visual Text-to-Speech with Scalable Diffusion Transformer
by: Liu, Huadai, et al.
Published: (2023)
by: Liu, Huadai, et al.
Published: (2023)
RAG-Boost: Retrieval-Augmented Generation Enhanced LLM-based Speech Recognition
by: Wang, Pengcheng, et al.
Published: (2025)
by: Wang, Pengcheng, et al.
Published: (2025)
XLSR-MamBo: Scaling the Hybrid Mamba-Attention Backbone for Audio Deepfake Detection
by: Ng, Kwok-Ho, et al.
Published: (2026)
by: Ng, Kwok-Ho, et al.
Published: (2026)
ViSAGe: Video-to-Spatial Audio Generation
by: Kim, Jaeyeon, et al.
Published: (2025)
by: Kim, Jaeyeon, et al.
Published: (2025)
Lend a Hand: Semi Training-Free Cued Speech Recognition via MLLM-Driven Hand Modeling for Barrier-free Communication
by: Huang, Guanjie, et al.
Published: (2025)
by: Huang, Guanjie, et al.
Published: (2025)
Two-sided Acoustic Metascreen for Broadband and Individual Reflection and Transmission Control
by: Chen, Ao, et al.
Published: (2024)
by: Chen, Ao, et al.
Published: (2024)
Aligning Text-to-Music Evaluation with Human Preferences
by: Huang, Yichen, et al.
Published: (2025)
by: Huang, Yichen, et al.
Published: (2025)
Neural Network-Based Time-Frequency-Bin-Wise Linear Combination of Beamformers for Underdetermined Target Source Extraction
by: Chen, Changda, et al.
Published: (2026)
by: Chen, Changda, et al.
Published: (2026)
Retrieval Augmented Generation in Prompt-based Text-to-Speech Synthesis with Context-Aware Contrastive Language-Audio Pretraining
by: Xue, Jinlong, et al.
Published: (2024)
by: Xue, Jinlong, et al.
Published: (2024)
Towards Generating Diverse Audio Captions via Adversarial Training
by: Mei, Xinhao, et al.
Published: (2022)
by: Mei, Xinhao, et al.
Published: (2022)
Speech Separation using Neural Audio Codecs with Embedding Loss
by: Yip, Jia Qi, et al.
Published: (2024)
by: Yip, Jia Qi, et al.
Published: (2024)
Solla: Towards a Speech-Oriented LLM That Hears Acoustic Context
by: Ao, Junyi, et al.
Published: (2025)
by: Ao, Junyi, et al.
Published: (2025)
NTT Multi-Speaker ASR System for the DASR Task of CHiME-8 Challenge
by: Kamo, Naoyuki, et al.
Published: (2024)
by: Kamo, Naoyuki, et al.
Published: (2024)
EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning
by: Xing, Zhenghao, et al.
Published: (2025)
by: Xing, Zhenghao, et al.
Published: (2025)
A Kalman Filter model for synchronization in musical ensembles
by: Carvalho, Hugo T., et al.
Published: (2024)
by: Carvalho, Hugo T., et al.
Published: (2024)
Emotional Text-To-Speech Based on Mutual-Information-Guided Emotion-Timbre Disentanglement
by: Yang, Jianing, et al.
Published: (2025)
by: Yang, Jianing, et al.
Published: (2025)
Speech-to-Text Translation with Phoneme-Augmented CoT: Enhancing Cross-Lingual Transfer in Low-Resource Scenarios
by: Gállego, Gerard I., et al.
Published: (2025)
by: Gállego, Gerard I., et al.
Published: (2025)
SynesLM: A Unified Approach for Audio-visual Speech Recognition and Translation via Language Model and Synthetic Data
by: Lu, Yichen, et al.
Published: (2024)
by: Lu, Yichen, et al.
Published: (2024)
Towards Expressive Video Dubbing with Multiscale Multimodal Context Interaction
by: Zhao, Yuan, et al.
Published: (2024)
by: Zhao, Yuan, et al.
Published: (2024)
A Reference-free Metric for Language-Queried Audio Source Separation using Contrastive Language-Audio Pretraining
by: Xiao, Feiyang, et al.
Published: (2024)
by: Xiao, Feiyang, et al.
Published: (2024)
C3LLM: Conditional Multimodal Content Generation Using Large Language Models
by: Wang, Zixuan, et al.
Published: (2024)
by: Wang, Zixuan, et al.
Published: (2024)
Low algorithmic delay implementation of convolutional beamformer for online joint source separation and dereverberation
by: Mo, Kaien, et al.
Published: (2024)
by: Mo, Kaien, et al.
Published: (2024)
Advancing Continual Learning for Robust Deepfake Audio Classification
by: Dong, Feiyi, et al.
Published: (2024)
by: Dong, Feiyi, et al.
Published: (2024)
Enhancing Retrieval-Augmented Audio Captioning with Generation-Assisted Multimodal Querying and Progressive Learning
by: Changin, Choi, et al.
Published: (2024)
by: Changin, Choi, et al.
Published: (2024)
Multimodal Sentiment Analysis with Missing Modality: A Knowledge-Transfer Approach
by: Liu, Weide, et al.
Published: (2023)
by: Liu, Weide, et al.
Published: (2023)
Transcribing and Translating, Fast and Slow: Joint Speech Translation and Recognition
by: Moritz, Niko, et al.
Published: (2024)
by: Moritz, Niko, et al.
Published: (2024)
Fine-Tuning Large Multimodal Models for Automatic Pronunciation Assessment
by: Wang, Ke, et al.
Published: (2025)
by: Wang, Ke, et al.
Published: (2025)
Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation
by: Wang, Baisen, et al.
Published: (2024)
by: Wang, Baisen, et al.
Published: (2024)
Similar Items
-
Applying LLMs for Rescoring N-best ASR Hypotheses of Casual Conversations: Effects of Domain Adaptation and Context Carry-over
by: Ogawa, Atsunori, et al.
Published: (2024) -
Bridging the gap: A comparative exploration of Speech-LLM and end-to-end architecture for multilingual conversational ASR
by: Mei, Yuxiang, et al.
Published: (2026) -
Hybrid-Sep: Language-queried audio source separation via pre-trained Model Fusion and Adversarial Diffusion Training
by: Feng, Jianyuan, et al.
Published: (2025) -
Sentence-wise Speech Summarization: Task, Datasets, and End-to-End Modeling with LM Knowledge Distillation
by: Matsuura, Kohei, et al.
Published: (2024) -
Long-Context Speech Synthesis with Context-Aware Memory
by: Li, Zhipeng, et al.
Published: (2025)