:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Liu, Wei, Li, Jiahong, Shao, Yiwen, Yu, Dong
Format:	Preprint
Published:	2025
Subjects:	Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2511.14410
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

AzeroS: Extending LLM to Speech with Self-Generated Instruction-Free Tuning
by: Shao, Yiwen, et al.
Published: (2025)

Transcribing and Translating, Fast and Slow: Joint Speech Translation and Recognition
by: Moritz, Niko, et al.
Published: (2024)

WhisperVC: Decoupled Cross-Domain Alignment and Speech Generation for Low-Resource Whisper-to-Normal Conversion
by: Liu, Dong, et al.
Published: (2025)

Auden-Voice: General-Purpose Voice Encoder for Speech and Language Understanding
by: Huo, Mingyue, et al.
Published: (2025)

TOGGL: Transcribing Overlapping Speech with Staggered Labeling
by: Li, Chak-Fai, et al.
Published: (2024)

Adapting Self-Supervised Speech Representations for Cross-lingual Dysarthria Detection in Parkinson's Disease
by: Hernandez, Abner, et al.
Published: (2026)

RIR-SF: Room Impulse Response Based Spatial Feature for Target Speech Recognition in Multi-Channel Multi-Speaker Scenarios
by: Shao, Yiwen, et al.
Published: (2023)

Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data
by: Saeki, Takaaki, et al.
Published: (2024)

SoulX-Transcriber: A Robust End-to-End Framework for Multi-Speaker Speech Transcription
by: Dai, Yuhang, et al.
Published: (2026)

CrossSpeech++: Cross-lingual Speech Synthesis with Decoupled Language and Speaker Generation
by: Kim, Ji-Hoon, et al.
Published: (2024)

Efficient Multilingual ASR Finetuning via LoRA Language Experts
by: Li, Jiahong, et al.
Published: (2025)

Cross-lingual Embedding Clustering for Hierarchical Softmax in Low-Resource Multilingual Speech Recognition
by: Yang, Zhengdong, et al.
Published: (2025)

Cross-lingual Data Selection Using Clip-level Acoustic Similarity for Enhancing Low-resource Automatic Speech Recognition
by: Mitsumori, Shunsuke, et al.
Published: (2025)

Revisiting Audio-language Pretraining for Learning General-purpose Audio Representation
by: Tseng, Wei-Cheng, et al.
Published: (2025)

Efficient Scaling for LLM-based ASR
by: Mu, Bingshen, et al.
Published: (2025)

TASU: Text-Only Alignment for Speech Understanding
by: Peng, Jing, et al.
Published: (2025)

Semantic-VAE: Semantic-Alignment Latent Representation for Better Speech Synthesis
by: Niu, Zhikang, et al.
Published: (2025)

SiamCTC: Learning Speech Representations through Monotonic Temporal Alignment
by: Eom, SooHwan, et al.
Published: (2026)

Cross-lingual Alzheimer's Disease detection based on paralinguistic and pre-trained features
by: Chen, Xuchu, et al.
Published: (2023)

GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio
by: Chen, Guoguo, et al.
Published: (2021)

Transcribe, Align and Segment: Creating speech datasets for low-resource languages
by: Sereda, Taras
Published: (2024)

Zero-shot Cross-lingual Voice Transfer for TTS
by: Biadsy, Fadi, et al.
Published: (2024)

TagSpeech: End-to-End Multi-Speaker ASR and Diarization with Fine-Grained Temporal Grounding
by: Huo, Mingyue, et al.
Published: (2026)

TTA-Bench: A Comprehensive Benchmark for Evaluating Text-to-Audio Models
by: Wang, Hui, et al.
Published: (2025)

LI-TTA: Language Informed Test-Time Adaptation for Automatic Speech Recognition
by: Yoon, Eunseop, et al.
Published: (2024)

Adaptive Inner Speech-Text Alignment for LLM-based Speech Translation
by: Liu, Henglyu, et al.
Published: (2025)

MOSS Transcribe Diarize Technical Report
by: AI, MOSI., et al.
Published: (2026)

Learning Time-Graph Frequency Representation for Monaural Speech Enhancement
by: Wang, Tingting, et al.
Published: (2025)

DualSpeechLM: Towards Unified Speech Understanding and Generation via Dual Speech Token Modeling with Large Language Models
by: Wang, Yuanyuan, et al.
Published: (2025)

Speech Recognition Transformers: Topological-lingualism Perspective
by: Singh, Shruti, et al.
Published: (2024)

SECodec: Structural Entropy-based Compressive Speech Representation Codec for Speech Language Models
by: Wang, Linqin, et al.
Published: (2024)

Acquiring Pronunciation Knowledge from Transcribed Speech Audio via Multi-task Learning
by: Sun, Siqi, et al.
Published: (2024)

Textless Streaming Speech-to-Speech Translation using Semantic Speech Tokens
by: Zhao, Jinzheng, et al.
Published: (2024)

Multi-Channel Multi-Speaker ASR Using Target Speaker's Solo Segment
by: Shao, Yiwen, et al.
Published: (2024)

Unlocking Strong Supervision: A Data-Centric Study of General-Purpose Audio Pre-Training Methods
by: Zhou, Xuanru, et al.
Published: (2026)

MUSA: Multi-lingual Speaker Anonymization via Serial Disentanglement
by: Yao, Jixun, et al.
Published: (2024)

MSR-Codec: A Low-Bitrate Multi-Stream Residual Codec for High-Fidelity Speech Generation with Information Disentanglement
by: Li, Jingyu, et al.
Published: (2025)

Entropy-based Coarse and Compressed Semantic Speech Representation Learning
by: Zuo, Jialong, et al.
Published: (2025)

Exploring Cross-Utterance Speech Contexts for Conformer-Transducer Speech Recognition Systems
by: Cui, Mingyu, et al.
Published: (2025)

Audio-Based Linguistic Feature Extraction for Enhancing Multi-lingual and Low-Resource Text-to-Speech
by: Kim, Youngjae, et al.
Published: (2024)