:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Shen, Maohao, Jayashankar, Tejas, Hanna, Osama, Kanda, Naoyuki, Wang, Yancheng, Žmolíková, Kateřina, Xie, Ruiming, Moritz, Niko, Xu, Anfeng, Gaur, Yashesh, Wornell, Gregory, He, Qing, Wu, Jilong
Format:	Preprint
Published:	2026
Subjects:	Sound Artificial Intelligence
Online Access:	https://arxiv.org/abs/2602.13891
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Conversational Speech Naturalness Predictor
by: Xu, Anfeng, et al.
Published: (2026)

Textless Streaming Speech-to-Speech Translation using Semantic Speech Tokens
by: Zhao, Jinzheng, et al.
Published: (2024)

Speech ReaLLM -- Real-time Streaming Speech Recognition with Multimodal LLMs by Teaching the Flow of Time
by: Seide, Frank, et al.
Published: (2024)

AGADIR: Towards Array-Geometry Agnostic Directional Speech Recognition
by: Lin, Ju, et al.
Published: (2024)

Get Large Language Models Ready to Speak: A Late-fusion Approach for Speech Generation
by: Shen, Maohao, et al.
Published: (2024)

Score-of-Mixture Training: Training One-Step Generative Models Made Simple via Score Estimation of Mixture Distributions
by: Jayashankar, Tejas, et al.
Published: (2025)

Frozen Large Language Models Can Perceive Paralinguistic Aspects of Speech
by: Kang, Wonjune, et al.
Published: (2024)

DiariST: Streaming Speech Translation with Speaker Diarization
by: Yang, Mu, et al.
Published: (2023)

Transcribing and Translating, Fast and Slow: Joint Speech Translation and Recognition
by: Moritz, Niko, et al.
Published: (2024)

Improving Practical Aspects of End-to-End Multi-Talker Speech Recognition for Online and Offline Scenarios
by: Subramanian, Aswin Shanmugam, et al.
Published: (2025)

SpeechX: Neural Codec Language Model as a Versatile Speech Transformer
by: Wang, Xiaofei, et al.
Published: (2023)

VowelPrompt: Hearing Speech Emotions from Text via Vowel-level Prosodic Augmentation
by: Wang, Yancheng, et al.
Published: (2026)

Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits
by: Feng, Tiantian, et al.
Published: (2025)

Streaming Speaker Change Detection and Gender Classification for Transducer-Based Multi-Talker Speech Translation
by: Wang, Peidong, et al.
Published: (2025)

Examining Test-Time Adaptation for Personalized Child Speech Recognition
by: Shi, Zhonghao, et al.
Published: (2024)

LongSpeech: A Scalable Benchmark for Transcription, Translation and Understanding in Long Speech
by: Yang, Fei, et al.
Published: (2026)

Coding Speech through Vocal Tract Kinematics
by: Cho, Cheol Jun, et al.
Published: (2024)

M-BEST-RQ: A Multi-Channel Speech Foundation Model for Smart Glasses
by: Yang, Yufeng, et al.
Published: (2024)

Directional Source Separation for Robust Speech Recognition on Smart Glasses
by: Feng, Tiantian, et al.
Published: (2023)

ASTRA: Aligning Speech and Text Representations for Asr without Sampling
by: Gaur, Neeraj, et al.
Published: (2024)

Voxlect: A Speech Foundation Model Benchmark for Modeling Dialects and Regional Languages Around the Globe
by: Feng, Tiantian, et al.
Published: (2025)

ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across Childhood
by: Feng, Tiantian, et al.
Published: (2026)

Speech-XL: Towards Long-Form Speech Understanding in Large Speech Language Models
by: Sun, Haoqin, et al.
Published: (2026)

DisCo-Speech: Controllable Zero-Shot Speech Generation with A Disentangled Speech Codec
by: Li, Tao, et al.
Published: (2025)

Streaming Decoder-Only Automatic Speech Recognition with Discrete Speech Units: A Pilot Study
by: Chen, Peikun, et al.
Published: (2024)

Who Said What WSW 2.0? Enhanced Automated Analysis of Preschool Classroom Speech
by: Sun, Anchen, et al.
Published: (2025)

Speech Prefix-Tuning with RNNT Loss for Improving LLM Predictions
by: Baskar, Murali Karthick, et al.
Published: (2024)

Profile-Error-Tolerant Target-Speaker Voice Activity Detection
by: Wang, Dongmei, et al.
Published: (2023)

Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like
by: Kanda, Naoyuki, et al.
Published: (2024)

DDSP-QbE++: Improving Speech Quality for Speech Anonymisation for Atypical Speech
by: Ghosh, Suhita, et al.
Published: (2026)

Multi-Channel Speech Enhancement for Cocktail Party Speech Emotion Recognition
by: Chen, Youjun, et al.
Published: (2026)

X-Talk: On the Underestimated Potential of Modular Speech-to-Speech Dialogue System
by: Liu, Zhanxun, et al.
Published: (2025)

SpeechT: Findings of the First Mentorship in Speech Translation
by: Moslem, Yasmin, et al.
Published: (2025)

AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling
by: Shi, Jiacheng, et al.
Published: (2026)

An Efficient Transfer Learning Method Based on Adapter with Local Attributes for Speech Emotion Recognition
by: Song, Haoyu, et al.
Published: (2025)

MSceneSpeech: A Multi-Scene Speech Dataset For Expressive Speech Synthesis
by: Yang, Qian, et al.
Published: (2024)

Joint ASR and Speaker Role Tagging with Serialized Output Training
by: Xu, Anfeng, et al.
Published: (2025)

WenetSpeech-Chuan: A Large-Scale Sichuanese Corpus with Rich Annotation for Dialectal Speech Processing
by: Dai, Yuhang, et al.
Published: (2025)

Forensic Similarity for Speech Deepfakes
by: Negroni, Viola, et al.
Published: (2025)

EMG-to-Speech with Fewer Channels
by: Hwang, Injune, et al.
Published: (2026)