Saved in:
| Main Authors: | Shen, Maohao, Jayashankar, Tejas, Hanna, Osama, Kanda, Naoyuki, Wang, Yancheng, Žmolíková, Kateřina, Xie, Ruiming, Moritz, Niko, Xu, Anfeng, Gaur, Yashesh, Wornell, Gregory, He, Qing, Wu, Jilong |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.13891 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Conversational Speech Naturalness Predictor
by: Xu, Anfeng, et al.
Published: (2026)
by: Xu, Anfeng, et al.
Published: (2026)
Textless Streaming Speech-to-Speech Translation using Semantic Speech Tokens
by: Zhao, Jinzheng, et al.
Published: (2024)
by: Zhao, Jinzheng, et al.
Published: (2024)
Speech ReaLLM -- Real-time Streaming Speech Recognition with Multimodal LLMs by Teaching the Flow of Time
by: Seide, Frank, et al.
Published: (2024)
by: Seide, Frank, et al.
Published: (2024)
AGADIR: Towards Array-Geometry Agnostic Directional Speech Recognition
by: Lin, Ju, et al.
Published: (2024)
by: Lin, Ju, et al.
Published: (2024)
Get Large Language Models Ready to Speak: A Late-fusion Approach for Speech Generation
by: Shen, Maohao, et al.
Published: (2024)
by: Shen, Maohao, et al.
Published: (2024)
Score-of-Mixture Training: Training One-Step Generative Models Made Simple via Score Estimation of Mixture Distributions
by: Jayashankar, Tejas, et al.
Published: (2025)
by: Jayashankar, Tejas, et al.
Published: (2025)
Frozen Large Language Models Can Perceive Paralinguistic Aspects of Speech
by: Kang, Wonjune, et al.
Published: (2024)
by: Kang, Wonjune, et al.
Published: (2024)
DiariST: Streaming Speech Translation with Speaker Diarization
by: Yang, Mu, et al.
Published: (2023)
by: Yang, Mu, et al.
Published: (2023)
Transcribing and Translating, Fast and Slow: Joint Speech Translation and Recognition
by: Moritz, Niko, et al.
Published: (2024)
by: Moritz, Niko, et al.
Published: (2024)
Improving Practical Aspects of End-to-End Multi-Talker Speech Recognition for Online and Offline Scenarios
by: Subramanian, Aswin Shanmugam, et al.
Published: (2025)
by: Subramanian, Aswin Shanmugam, et al.
Published: (2025)
SpeechX: Neural Codec Language Model as a Versatile Speech Transformer
by: Wang, Xiaofei, et al.
Published: (2023)
by: Wang, Xiaofei, et al.
Published: (2023)
VowelPrompt: Hearing Speech Emotions from Text via Vowel-level Prosodic Augmentation
by: Wang, Yancheng, et al.
Published: (2026)
by: Wang, Yancheng, et al.
Published: (2026)
Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits
by: Feng, Tiantian, et al.
Published: (2025)
by: Feng, Tiantian, et al.
Published: (2025)
Streaming Speaker Change Detection and Gender Classification for Transducer-Based Multi-Talker Speech Translation
by: Wang, Peidong, et al.
Published: (2025)
by: Wang, Peidong, et al.
Published: (2025)
Examining Test-Time Adaptation for Personalized Child Speech Recognition
by: Shi, Zhonghao, et al.
Published: (2024)
by: Shi, Zhonghao, et al.
Published: (2024)
LongSpeech: A Scalable Benchmark for Transcription, Translation and Understanding in Long Speech
by: Yang, Fei, et al.
Published: (2026)
by: Yang, Fei, et al.
Published: (2026)
Coding Speech through Vocal Tract Kinematics
by: Cho, Cheol Jun, et al.
Published: (2024)
by: Cho, Cheol Jun, et al.
Published: (2024)
M-BEST-RQ: A Multi-Channel Speech Foundation Model for Smart Glasses
by: Yang, Yufeng, et al.
Published: (2024)
by: Yang, Yufeng, et al.
Published: (2024)
Directional Source Separation for Robust Speech Recognition on Smart Glasses
by: Feng, Tiantian, et al.
Published: (2023)
by: Feng, Tiantian, et al.
Published: (2023)
ASTRA: Aligning Speech and Text Representations for Asr without Sampling
by: Gaur, Neeraj, et al.
Published: (2024)
by: Gaur, Neeraj, et al.
Published: (2024)
Voxlect: A Speech Foundation Model Benchmark for Modeling Dialects and Regional Languages Around the Globe
by: Feng, Tiantian, et al.
Published: (2025)
by: Feng, Tiantian, et al.
Published: (2025)
ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across Childhood
by: Feng, Tiantian, et al.
Published: (2026)
by: Feng, Tiantian, et al.
Published: (2026)
Speech-XL: Towards Long-Form Speech Understanding in Large Speech Language Models
by: Sun, Haoqin, et al.
Published: (2026)
by: Sun, Haoqin, et al.
Published: (2026)
DisCo-Speech: Controllable Zero-Shot Speech Generation with A Disentangled Speech Codec
by: Li, Tao, et al.
Published: (2025)
by: Li, Tao, et al.
Published: (2025)
Streaming Decoder-Only Automatic Speech Recognition with Discrete Speech Units: A Pilot Study
by: Chen, Peikun, et al.
Published: (2024)
by: Chen, Peikun, et al.
Published: (2024)
Who Said What WSW 2.0? Enhanced Automated Analysis of Preschool Classroom Speech
by: Sun, Anchen, et al.
Published: (2025)
by: Sun, Anchen, et al.
Published: (2025)
Speech Prefix-Tuning with RNNT Loss for Improving LLM Predictions
by: Baskar, Murali Karthick, et al.
Published: (2024)
by: Baskar, Murali Karthick, et al.
Published: (2024)
Profile-Error-Tolerant Target-Speaker Voice Activity Detection
by: Wang, Dongmei, et al.
Published: (2023)
by: Wang, Dongmei, et al.
Published: (2023)
Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like
by: Kanda, Naoyuki, et al.
Published: (2024)
by: Kanda, Naoyuki, et al.
Published: (2024)
DDSP-QbE++: Improving Speech Quality for Speech Anonymisation for Atypical Speech
by: Ghosh, Suhita, et al.
Published: (2026)
by: Ghosh, Suhita, et al.
Published: (2026)
Multi-Channel Speech Enhancement for Cocktail Party Speech Emotion Recognition
by: Chen, Youjun, et al.
Published: (2026)
by: Chen, Youjun, et al.
Published: (2026)
X-Talk: On the Underestimated Potential of Modular Speech-to-Speech Dialogue System
by: Liu, Zhanxun, et al.
Published: (2025)
by: Liu, Zhanxun, et al.
Published: (2025)
SpeechT: Findings of the First Mentorship in Speech Translation
by: Moslem, Yasmin, et al.
Published: (2025)
by: Moslem, Yasmin, et al.
Published: (2025)
AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling
by: Shi, Jiacheng, et al.
Published: (2026)
by: Shi, Jiacheng, et al.
Published: (2026)
An Efficient Transfer Learning Method Based on Adapter with Local Attributes for Speech Emotion Recognition
by: Song, Haoyu, et al.
Published: (2025)
by: Song, Haoyu, et al.
Published: (2025)
MSceneSpeech: A Multi-Scene Speech Dataset For Expressive Speech Synthesis
by: Yang, Qian, et al.
Published: (2024)
by: Yang, Qian, et al.
Published: (2024)
Joint ASR and Speaker Role Tagging with Serialized Output Training
by: Xu, Anfeng, et al.
Published: (2025)
by: Xu, Anfeng, et al.
Published: (2025)
WenetSpeech-Chuan: A Large-Scale Sichuanese Corpus with Rich Annotation for Dialectal Speech Processing
by: Dai, Yuhang, et al.
Published: (2025)
by: Dai, Yuhang, et al.
Published: (2025)
Forensic Similarity for Speech Deepfakes
by: Negroni, Viola, et al.
Published: (2025)
by: Negroni, Viola, et al.
Published: (2025)
EMG-to-Speech with Fewer Channels
by: Hwang, Injune, et al.
Published: (2026)
by: Hwang, Injune, et al.
Published: (2026)
Similar Items
-
Conversational Speech Naturalness Predictor
by: Xu, Anfeng, et al.
Published: (2026) -
Textless Streaming Speech-to-Speech Translation using Semantic Speech Tokens
by: Zhao, Jinzheng, et al.
Published: (2024) -
Speech ReaLLM -- Real-time Streaming Speech Recognition with Multimodal LLMs by Teaching the Flow of Time
by: Seide, Frank, et al.
Published: (2024) -
AGADIR: Towards Array-Geometry Agnostic Directional Speech Recognition
by: Lin, Ju, et al.
Published: (2024) -
Get Large Language Models Ready to Speak: A Late-fusion Approach for Speech Generation
by: Shen, Maohao, et al.
Published: (2024)