Saved in:
| Main Authors: | Xu, Kele, Wang, Yifan, Feng, Ming, Xu, Qisheng, Chen, Wuyang, Dou, Yutao, Yang, Cheng, Wang, Huaimin |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2603.11877 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Audio-Language Models for Audio-Centric Tasks: A Systematic Survey
by: Su, Yi, et al.
Published: (2025)
by: Su, Yi, et al.
Published: (2025)
Contrastive Learning-based Chaining-Cluster for Multilingual Voice-Face Association
by: Chen, Wuyang, et al.
Published: (2024)
by: Chen, Wuyang, et al.
Published: (2024)
AudioCIL: A Python Toolbox for Audio Class-Incremental Learning with Multiple Scenes
by: Xu, Qisheng, et al.
Published: (2024)
by: Xu, Qisheng, et al.
Published: (2024)
A Comprehensive Review and Taxonomy of Audio-Visual Synchronization Techniques for Realistic Speech Animation
by: Fernandes, Jose Geraldo, et al.
Published: (2024)
by: Fernandes, Jose Geraldo, et al.
Published: (2024)
Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis
by: Liao, Shijia, et al.
Published: (2024)
by: Liao, Shijia, et al.
Published: (2024)
Speech-Omni-Lite: Portable Speech Interfaces for Vision-Language Models
by: Tao, Dehua, et al.
Published: (2026)
by: Tao, Dehua, et al.
Published: (2026)
Ultrasensitive Textile Strain Sensors Redefine Wearable Silent Speech Interfaces with High Machine Learning Efficiency
by: Tang, Chenyu, et al.
Published: (2023)
by: Tang, Chenyu, et al.
Published: (2023)
Unify Variables in Neural Scaling Laws for General Audio Representations via Embedding Effective Rank
by: Deng, Xuyao, et al.
Published: (2025)
by: Deng, Xuyao, et al.
Published: (2025)
Diarization-Aware Multi-Speaker Automatic Speech Recognition via Large Language Models
by: Lin, Yuke, et al.
Published: (2025)
by: Lin, Yuke, et al.
Published: (2025)
A Survey on Speech Large Language Models for Understanding
by: Peng, Jing, et al.
Published: (2024)
by: Peng, Jing, et al.
Published: (2024)
Pretraining Large Brain Language Model for Active BCI: Silent Speech
by: Zhou, Jinzhao, et al.
Published: (2025)
by: Zhou, Jinzhao, et al.
Published: (2025)
Towards Fine-Grained and Multi-Granular Contrastive Language-Speech Pre-training
by: Yang, Yifan, et al.
Published: (2026)
by: Yang, Yifan, et al.
Published: (2026)
Poster: Recognizing Hidden-in-the-Ear Private Key for Reliable Silent Speech Interface Using Multi-Task Learning
by: Dong, Xuefu, et al.
Published: (2025)
by: Dong, Xuefu, et al.
Published: (2025)
DM-ASR: Diarization-aware Multi-speaker ASR with Large Language Models
by: Li, Li, et al.
Published: (2026)
by: Li, Li, et al.
Published: (2026)
IR-UWB Radar-Based Contactless Silent Speech Recognition with Attention-Enhanced Temporal Convolutional Networks
by: Lee, Sunghwa, et al.
Published: (2025)
by: Lee, Sunghwa, et al.
Published: (2025)
LipVoicer: Generating Speech from Silent Videos Guided by Lip Reading
by: Yemini, Yochai, et al.
Published: (2023)
by: Yemini, Yochai, et al.
Published: (2023)
Interleaved Speech-Text Language Models for Simple Streaming Text-to-Speech Synthesis
by: Yang, Yifan, et al.
Published: (2024)
by: Yang, Yifan, et al.
Published: (2024)
FlexSpeech: Towards Stable, Controllable and Expressive Text-to-Speech
by: Ma, Linhan, et al.
Published: (2025)
by: Ma, Linhan, et al.
Published: (2025)
Exploring Speech Foundation Models for Speaker Diarization Across Lifespan
by: Xu, Anfeng, et al.
Published: (2026)
by: Xu, Anfeng, et al.
Published: (2026)
Rethinking Continual Learning for Speech and Audio: A Representation-Centric Taxonomy and Open Problems
by: Xiao, Yang, et al.
Published: (2026)
by: Xiao, Yang, et al.
Published: (2026)
AS-Speech: Adaptive Style For Speech Synthesis
by: Li, Zhipeng, et al.
Published: (2024)
by: Li, Zhipeng, et al.
Published: (2024)
Semantic Proximity Alignment: Towards Human Perception-consistent Audio Tagging by Aligning with Label Text Description
by: Liu, Wuyang, et al.
Published: (2023)
by: Liu, Wuyang, et al.
Published: (2023)
Phonemes vs. Projectors: An Investigation of Speech-Language Interfaces for LLM-based ASR
by: Li, Ziwei, et al.
Published: (2026)
by: Li, Ziwei, et al.
Published: (2026)
Speech Token Prediction via Compressed-to-fine Language Modeling for Speech Generation
by: Liu, Wenrui, et al.
Published: (2025)
by: Liu, Wenrui, et al.
Published: (2025)
Affect Decoding in Phonated and Silent Speech Production from Surface EMG
by: Pistrosch, Simon, et al.
Published: (2026)
by: Pistrosch, Simon, et al.
Published: (2026)
Evaluating Text-to-Speech Synthesis from a Large Discrete Token-based Speech Language Model
by: Wang, Siyang, et al.
Published: (2024)
by: Wang, Siyang, et al.
Published: (2024)
Toward Fair Speech Technologies: A Comprehensive Survey of Bias and Fairness in Speech AI
by: Lin, Yi-Cheng, et al.
Published: (2026)
by: Lin, Yi-Cheng, et al.
Published: (2026)
Adversarial Attacks and Robust Defenses in Speaker Embedding based Zero-Shot Text-to-Speech System
by: Li, Ze, et al.
Published: (2024)
by: Li, Ze, et al.
Published: (2024)
WenetSpeech4TTS: A 12,800-hour Mandarin TTS Corpus for Large Speech Generation Model Benchmark
by: Ma, Linhan, et al.
Published: (2024)
by: Ma, Linhan, et al.
Published: (2024)
Hearing from Silence: Reasoning Audio Descriptions from Silent Videos via Vision-Language Model
by: Ren, Yong, et al.
Published: (2025)
by: Ren, Yong, et al.
Published: (2025)
Adaptive Speaker Embedding Self-Augmentation for Personal Voice Activity Detection with Short Enrollment Speech
by: Feng, Fuyuan, et al.
Published: (2026)
by: Feng, Fuyuan, et al.
Published: (2026)
Adapting Speech Foundation Models for Unified Multimodal Speech Recognition with Large Language Models
by: Zhang, Jing-Xuan, et al.
Published: (2025)
by: Zhang, Jing-Xuan, et al.
Published: (2025)
ParaS2S: Benchmarking and Aligning Spoken Language Models for Paralinguistic-aware Speech-to-Speech Interaction
by: Yang, Shu-wen, et al.
Published: (2025)
by: Yang, Shu-wen, et al.
Published: (2025)
A Parallel Ultra-Low Power Silent Speech Interface based on a Wearable, Fully-dry EMG Neckband
by: Meier, Fiona, et al.
Published: (2025)
by: Meier, Fiona, et al.
Published: (2025)
Position: Towards Responsible Evaluation for Text-to-Speech
by: Yang, Yifan, et al.
Published: (2025)
by: Yang, Yifan, et al.
Published: (2025)
DiveSound: LLM-Assisted Automatic Taxonomy Construction for Diverse Audio Generation
by: Li, Baihan, et al.
Published: (2024)
by: Li, Baihan, et al.
Published: (2024)
Towards Controllable Speech Synthesis in the Era of Large Language Models: A Systematic Survey
by: Xie, Tianxin, et al.
Published: (2024)
by: Xie, Tianxin, et al.
Published: (2024)
ChildMandarin: A Comprehensive Mandarin Speech Dataset for Young Children Aged 3-5
by: Zhou, Jiaming, et al.
Published: (2024)
by: Zhou, Jiaming, et al.
Published: (2024)
DualSpeechLM: Towards Unified Speech Understanding and Generation via Dual Speech Token Modeling with Large Language Models
by: Wang, Yuanyuan, et al.
Published: (2025)
by: Wang, Yuanyuan, et al.
Published: (2025)
Bayesian Speech Synthesizers Can Learn from Multiple Teachers
by: Zhang, Ziyang, et al.
Published: (2025)
by: Zhang, Ziyang, et al.
Published: (2025)
Similar Items
-
Audio-Language Models for Audio-Centric Tasks: A Systematic Survey
by: Su, Yi, et al.
Published: (2025) -
Contrastive Learning-based Chaining-Cluster for Multilingual Voice-Face Association
by: Chen, Wuyang, et al.
Published: (2024) -
AudioCIL: A Python Toolbox for Audio Class-Incremental Learning with Multiple Scenes
by: Xu, Qisheng, et al.
Published: (2024) -
A Comprehensive Review and Taxonomy of Audio-Visual Synchronization Techniques for Realistic Speech Animation
by: Fernandes, Jose Geraldo, et al.
Published: (2024) -
Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis
by: Liao, Shijia, et al.
Published: (2024)