:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Lu, Yichen, Dai, Wei, Liu, Jiaen, Kwok, Ching Wing, Wu, Zongheng, Xiao, Xudong, Sun, Ao, Fu, Sheng, Zhan, Jianyuan, Wang, Yian, Saito, Takatomo, Lai, Sicheng
Format:	Preprint
Published:	2025
Subjects:	Artificial Intelligence Computation and Language Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2507.07306
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Applying LLMs for Rescoring N-best ASR Hypotheses of Casual Conversations: Effects of Domain Adaptation and Context Carry-over
by: Ogawa, Atsunori, et al.
Published: (2024)

Bridging the gap: A comparative exploration of Speech-LLM and end-to-end architecture for multilingual conversational ASR
by: Mei, Yuxiang, et al.
Published: (2026)

Hybrid-Sep: Language-queried audio source separation via pre-trained Model Fusion and Adversarial Diffusion Training
by: Feng, Jianyuan, et al.
Published: (2025)

Sentence-wise Speech Summarization: Task, Datasets, and End-to-End Modeling with LM Knowledge Distillation
by: Matsuura, Kohei, et al.
Published: (2024)

Long-Context Speech Synthesis with Context-Aware Memory
by: Li, Zhipeng, et al.
Published: (2025)

Training Data Augmentation for Dysarthric Automatic Speech Recognition by Text-to-Dysarthric-Speech Synthesis
by: Leung, Wing-Zin, et al.
Published: (2024)

Voice Conversion Augmentation for Speaker Recognition on Defective Datasets
by: Tao, Ruijie, et al.
Published: (2024)

Seeing the Context: Rich Visual Context-Aware Speech Recognition via Multimodal Reasoning
by: Tian, Wenjie, et al.
Published: (2026)

EMMeTT: Efficient Multimodal Machine Translation Training
by: Żelasko, Piotr, et al.
Published: (2024)

AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations
by: Wu, Sheng, et al.
Published: (2024)

Multimodal Consistency-Guided Reference-Free Data Selection for ASR Accent Adaptation
by: Lei, Ligong, et al.
Published: (2026)

Autonomous Soundscape Augmentation with Multimodal Fusion of Visual and Participant-linked Inputs
by: Ooi, Kenneth, et al.
Published: (2023)

ViT-TTS: Visual Text-to-Speech with Scalable Diffusion Transformer
by: Liu, Huadai, et al.
Published: (2023)

RAG-Boost: Retrieval-Augmented Generation Enhanced LLM-based Speech Recognition
by: Wang, Pengcheng, et al.
Published: (2025)

XLSR-MamBo: Scaling the Hybrid Mamba-Attention Backbone for Audio Deepfake Detection
by: Ng, Kwok-Ho, et al.
Published: (2026)

ViSAGe: Video-to-Spatial Audio Generation
by: Kim, Jaeyeon, et al.
Published: (2025)

Lend a Hand: Semi Training-Free Cued Speech Recognition via MLLM-Driven Hand Modeling for Barrier-free Communication
by: Huang, Guanjie, et al.
Published: (2025)

Two-sided Acoustic Metascreen for Broadband and Individual Reflection and Transmission Control
by: Chen, Ao, et al.
Published: (2024)

Aligning Text-to-Music Evaluation with Human Preferences
by: Huang, Yichen, et al.
Published: (2025)

Neural Network-Based Time-Frequency-Bin-Wise Linear Combination of Beamformers for Underdetermined Target Source Extraction
by: Chen, Changda, et al.
Published: (2026)

Retrieval Augmented Generation in Prompt-based Text-to-Speech Synthesis with Context-Aware Contrastive Language-Audio Pretraining
by: Xue, Jinlong, et al.
Published: (2024)

Towards Generating Diverse Audio Captions via Adversarial Training
by: Mei, Xinhao, et al.
Published: (2022)

Speech Separation using Neural Audio Codecs with Embedding Loss
by: Yip, Jia Qi, et al.
Published: (2024)

Solla: Towards a Speech-Oriented LLM That Hears Acoustic Context
by: Ao, Junyi, et al.
Published: (2025)

NTT Multi-Speaker ASR System for the DASR Task of CHiME-8 Challenge
by: Kamo, Naoyuki, et al.
Published: (2024)

EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning
by: Xing, Zhenghao, et al.
Published: (2025)

A Kalman Filter model for synchronization in musical ensembles
by: Carvalho, Hugo T., et al.
Published: (2024)

Emotional Text-To-Speech Based on Mutual-Information-Guided Emotion-Timbre Disentanglement
by: Yang, Jianing, et al.
Published: (2025)

Speech-to-Text Translation with Phoneme-Augmented CoT: Enhancing Cross-Lingual Transfer in Low-Resource Scenarios
by: Gállego, Gerard I., et al.
Published: (2025)

SynesLM: A Unified Approach for Audio-visual Speech Recognition and Translation via Language Model and Synthetic Data
by: Lu, Yichen, et al.
Published: (2024)

Towards Expressive Video Dubbing with Multiscale Multimodal Context Interaction
by: Zhao, Yuan, et al.
Published: (2024)

A Reference-free Metric for Language-Queried Audio Source Separation using Contrastive Language-Audio Pretraining
by: Xiao, Feiyang, et al.
Published: (2024)

C3LLM: Conditional Multimodal Content Generation Using Large Language Models
by: Wang, Zixuan, et al.
Published: (2024)

Low algorithmic delay implementation of convolutional beamformer for online joint source separation and dereverberation
by: Mo, Kaien, et al.
Published: (2024)

Advancing Continual Learning for Robust Deepfake Audio Classification
by: Dong, Feiyi, et al.
Published: (2024)

Enhancing Retrieval-Augmented Audio Captioning with Generation-Assisted Multimodal Querying and Progressive Learning
by: Changin, Choi, et al.
Published: (2024)

Multimodal Sentiment Analysis with Missing Modality: A Knowledge-Transfer Approach
by: Liu, Weide, et al.
Published: (2023)

Transcribing and Translating, Fast and Slow: Joint Speech Translation and Recognition
by: Moritz, Niko, et al.
Published: (2024)

Fine-Tuning Large Multimodal Models for Automatic Pronunciation Assessment
by: Wang, Ke, et al.
Published: (2025)

Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation
by: Wang, Baisen, et al.
Published: (2024)