Saved in:
| Main Authors: | Wang, Rong, Sun, Kun |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2404.12077 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning
by: Zhang, Shaolei, et al.
Published: (2024)
by: Zhang, Shaolei, et al.
Published: (2024)
VisionScores -- A system-segmented image score dataset for deep learning tasks
by: Amezcua, Alejandro Romero, et al.
Published: (2025)
by: Amezcua, Alejandro Romero, et al.
Published: (2025)
PCOV-KWS: Multi-task Learning for Personalized Customizable Open Vocabulary Keyword Spotting
by: Pan, Jianan, et al.
Published: (2026)
by: Pan, Jianan, et al.
Published: (2026)
Speaker- and Text-Independent Estimation of Articulatory Movements and Phoneme Alignments from Speech
by: Weise, Tobias, et al.
Published: (2024)
by: Weise, Tobias, et al.
Published: (2024)
SEAL: Speaker Error Correction using Acoustic-conditioned Large Language Models
by: Kumar, Anurag, et al.
Published: (2025)
by: Kumar, Anurag, et al.
Published: (2025)
Large Language Model Based Generative Error Correction: A Challenge and Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition
by: Yang, Chao-Han Huck, et al.
Published: (2024)
by: Yang, Chao-Han Huck, et al.
Published: (2024)
Evaluating the Effectiveness of Transformer Layers in Wav2Vec 2.0, XLS-R, and Whisper for Speaker Identification Tasks
by: Stuhlmann, Linus, et al.
Published: (2025)
by: Stuhlmann, Linus, et al.
Published: (2025)
USAT: A Universal Speaker-Adaptive Text-to-Speech Approach
by: Wang, Wenbin, et al.
Published: (2024)
by: Wang, Wenbin, et al.
Published: (2024)
Low-resource speech recognition and dialect identification of Irish in a multi-task framework
by: Lonergan, Liam, et al.
Published: (2024)
by: Lonergan, Liam, et al.
Published: (2024)
SEGAA: A Unified Approach to Predicting Age, Gender, and Emotion in Speech
by: R, Aron, et al.
Published: (2024)
by: R, Aron, et al.
Published: (2024)
Multimodal Sentiment Analysis with Missing Modality: A Knowledge-Transfer Approach
by: Liu, Weide, et al.
Published: (2023)
by: Liu, Weide, et al.
Published: (2023)
Unsupervised Speech Segmentation: A General Approach Using Speech Language Models
by: Elmakies, Avishai, et al.
Published: (2025)
by: Elmakies, Avishai, et al.
Published: (2025)
Innovative Speech-Based Deep Learning Approaches for Parkinson's Disease Classification: A Systematic Review
by: van Gelderen, Lisanne, et al.
Published: (2024)
by: van Gelderen, Lisanne, et al.
Published: (2024)
Dynamic ASR Pathways: An Adaptive Masking Approach Towards Efficient Pruning of A Multilingual ASR Model
by: Xie, Jiamin, et al.
Published: (2023)
by: Xie, Jiamin, et al.
Published: (2023)
Towards Controllable Speech Synthesis in the Era of Large Language Models: A Systematic Survey
by: Xie, Tianxin, et al.
Published: (2024)
by: Xie, Tianxin, et al.
Published: (2024)
CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations
by: Zhang, Leying, et al.
Published: (2024)
by: Zhang, Leying, et al.
Published: (2024)
ArVoice: A Multi-Speaker Dataset for Arabic Speech Synthesis
by: Toyin, Hawau Olamide, et al.
Published: (2025)
by: Toyin, Hawau Olamide, et al.
Published: (2025)
Multi-Convformer: Extending Conformer with Multiple Convolution Kernels
by: Prabhu, Darshan, et al.
Published: (2024)
by: Prabhu, Darshan, et al.
Published: (2024)
Cross-Speaker Encoding Network for Multi-Talker Speech Recognition
by: Kang, Jiawen, et al.
Published: (2024)
by: Kang, Jiawen, et al.
Published: (2024)
Sortformer: A Novel Approach for Permutation-Resolved Speaker Supervision in Speech-to-Text Systems
by: Park, Taejin, et al.
Published: (2024)
by: Park, Taejin, et al.
Published: (2024)
MultiMed-ST: Large-scale Many-to-many Multilingual Medical Speech Translation
by: Le-Duc, Khai, et al.
Published: (2025)
by: Le-Duc, Khai, et al.
Published: (2025)
Streaming Speaker Change Detection and Gender Classification for Transducer-Based Multi-Talker Speech Translation
by: Wang, Peidong, et al.
Published: (2025)
by: Wang, Peidong, et al.
Published: (2025)
Do Music Generation Models Encode Music Theory?
by: Wei, Megan, et al.
Published: (2024)
by: Wei, Megan, et al.
Published: (2024)
Speech Rhythm-Based Speaker Embeddings Extraction from Phonemes and Phoneme Duration for Multi-Speaker Speech Synthesis
by: Fujita, Kenichi, et al.
Published: (2024)
by: Fujita, Kenichi, et al.
Published: (2024)
TokenChain: A Discrete Speech Chain via Semantic Token Modeling
by: Wang, Mingxuan, et al.
Published: (2025)
by: Wang, Mingxuan, et al.
Published: (2025)
Unveiling Audio Deepfake Origins: A Deep Metric learning And Conformer Network Approach With Ensemble Fusion
by: Kulkarni, Ajinkya, et al.
Published: (2025)
by: Kulkarni, Ajinkya, et al.
Published: (2025)
Survey of End-to-End Multi-Speaker Automatic Speech Recognition for Monaural Audio
by: He, Xinlu, et al.
Published: (2025)
by: He, Xinlu, et al.
Published: (2025)
Task Arithmetic can Mitigate Synthetic-to-Real Gap in Automatic Speech Recognition
by: Su, Hsuan, et al.
Published: (2024)
by: Su, Hsuan, et al.
Published: (2024)
Multi-class Decoding of Attended Speaker Direction Using Electroencephalogram and Audio Spatial Spectrum
by: Zhang, Yuanming, et al.
Published: (2024)
by: Zhang, Yuanming, et al.
Published: (2024)
Modality-Invariant Bidirectional Temporal Representation Distillation Network for Missing Multimodal Sentiment Analysis
by: Wang, Xincheng, et al.
Published: (2025)
by: Wang, Xincheng, et al.
Published: (2025)
SpeechCLIP+: Self-supervised multi-task representation learning for speech via CLIP and speech-image data
by: Wang, Hsuan-Fu, et al.
Published: (2024)
by: Wang, Hsuan-Fu, et al.
Published: (2024)
FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation
by: Liu, Yutong, et al.
Published: (2025)
by: Liu, Yutong, et al.
Published: (2025)
FlashSpeech: Efficient Zero-Shot Speech Synthesis
by: Ye, Zhen, et al.
Published: (2024)
by: Ye, Zhen, et al.
Published: (2024)
Foundation Models for Music: A Survey
by: Ma, Yinghao, et al.
Published: (2024)
by: Ma, Yinghao, et al.
Published: (2024)
COMET: Concept Space Dissection of the Modality Gap in Audio-Text Multimodal Contrastive Embeddings
by: Zhu, Yonggang, et al.
Published: (2026)
by: Zhu, Yonggang, et al.
Published: (2026)
DiTAR: Diffusion Transformer Autoregressive Modeling for Speech Generation
by: Jia, Dongya, et al.
Published: (2025)
by: Jia, Dongya, et al.
Published: (2025)
Autoregressive Diffusion Transformer for Text-to-Speech Synthesis
by: Liu, Zhijun, et al.
Published: (2024)
by: Liu, Zhijun, et al.
Published: (2024)
C3LLM: Conditional Multimodal Content Generation Using Large Language Models
by: Wang, Zixuan, et al.
Published: (2024)
by: Wang, Zixuan, et al.
Published: (2024)
ComposerX: Multi-Agent Symbolic Music Composition with LLMs
by: Deng, Qixin, et al.
Published: (2024)
by: Deng, Qixin, et al.
Published: (2024)
SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning
by: Nam, KiHyun, et al.
Published: (2026)
by: Nam, KiHyun, et al.
Published: (2026)
Similar Items
-
StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning
by: Zhang, Shaolei, et al.
Published: (2024) -
VisionScores -- A system-segmented image score dataset for deep learning tasks
by: Amezcua, Alejandro Romero, et al.
Published: (2025) -
PCOV-KWS: Multi-task Learning for Personalized Customizable Open Vocabulary Keyword Spotting
by: Pan, Jianan, et al.
Published: (2026) -
Speaker- and Text-Independent Estimation of Articulatory Movements and Phoneme Alignments from Speech
by: Weise, Tobias, et al.
Published: (2024) -
SEAL: Speaker Error Correction using Acoustic-conditioned Large Language Models
by: Kumar, Anurag, et al.
Published: (2025)