:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Cao, Di, Fu, Dongjie, Yu, Hai, Zheng, Siqi, Tan, Xu, Jin, Tao
Format:	Preprint
Published:	2026
Subjects:	Audio and Speech Processing Artificial Intelligence Computation and Language
Online Access:	https://arxiv.org/abs/2603.24596
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

TTA: Transcribe, Translate and Alignment for Cross-lingual Speech Representation
by: Liu, Wei, et al.
Published: (2025)

Improving Speech Emotion Recognition Through Cross Modal Attention Alignment and Balanced Stacking Model
by: Ueda, Lucas, et al.
Published: (2025)

Dynamic Frequency-Adaptive Knowledge Distillation for Speech Enhancement
by: Yuan, Xihao, et al.
Published: (2025)

UME: Upcycling Mixture-of-Experts for Scalable and Efficient Automatic Speech Recognition
by: Fu, Li, et al.
Published: (2024)

TASU: Text-Only Alignment for Speech Understanding
by: Peng, Jing, et al.
Published: (2025)

SSR: Alignment-Aware Modality Connector for Speech Language Models
by: Tan, Weiting, et al.
Published: (2024)

Dual-Branch Knowledge Distillation for Noise-Robust Synthetic Speech Detection
by: Fan, Cunhang, et al.
Published: (2023)

ToneUnit: A Speech Discretization Approach for Tonal Language Speech Synthesis
by: Tao, Dehua, et al.
Published: (2024)

Cross-Modal Bottleneck Fusion For Noise Robust Audio-Visual Speech Recognition
by: Ok, Seaone, et al.
Published: (2026)

Reducing Linguistic Hallucination in LM-Based Speech Enhancement via Noise-Invariant Acoustic-Semantic Distillation
by: Wang, Zheng, et al.
Published: (2026)

Accelerating Diffusion-based Text-to-Speech Model Training with Dual Modality Alignment
by: Choi, Jeongsoo, et al.
Published: (2025)

Speech-Omni-Lite: Portable Speech Interfaces for Vision-Language Models
by: Tao, Dehua, et al.
Published: (2026)

Complex Recurrent Variational Autoencoder with Application to Speech Enhancement
by: Xie, Yuying, et al.
Published: (2022)

WhisperVC: Decoupled Cross-Domain Alignment and Speech Generation for Low-Resource Whisper-to-Normal Conversion
by: Liu, Dong, et al.
Published: (2025)

ARTT: Augmented Reverberant-Target Training for Unsupervised Monaural Speech Dereverberation
by: Song, Siqi, et al.
Published: (2026)

Continuous Speech Tokens Makes LLMs Robust Multi-Modality Learners
by: Yuan, Ze, et al.
Published: (2024)

TASU2: Controllable CTC Simulation for Alignment and Low-Resource Adaptation of Speech LLMs
by: Peng, Jing, et al.
Published: (2026)

Exploring the Capability of Mamba in Speech Applications
by: Miyazaki, Koichi, et al.
Published: (2024)

SpeechEditBench: A Bilingual Multi-Attribute Benchmark for Instruction-Guided Speech Editing
by: Zhang, Hanlin, et al.
Published: (2026)

Speech Token Prediction via Compressed-to-fine Language Modeling for Speech Generation
by: Liu, Wenrui, et al.
Published: (2025)

Multi-Distillation from Speech and Music Representation Models
by: Wei, Jui-Chiang, et al.
Published: (2025)

Adaptive Duration Model for Text Speech Alignment
by: Cao, Junjie
Published: (2025)

FlexSpeech: Towards Stable, Controllable and Expressive Text-to-Speech
by: Ma, Linhan, et al.
Published: (2025)

Robust One-step Speech Enhancement via Consistency Distillation
by: Xu, Liang, et al.
Published: (2025)

DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment
by: Lu, Ke-Han, et al.
Published: (2024)

Distil-DCCRN: A Small-footprint DCCRN Leveraging Feature-based Knowledge Distillation in Speech Enhancement
by: Han, Runduo, et al.
Published: (2024)

AV-CrossNet: an Audiovisual Complex Spectral Mapping Network for Speech Separation By Leveraging Narrow- and Cross-Band Modeling
by: Kalkhorani, Vahid Ahmadi, et al.
Published: (2024)

USpeech: Ultrasound-Enhanced Speech with Minimal Human Effort via Cross-Modal Synthesis
by: Yu, Luca Jiang-Tao, et al.
Published: (2024)

LLMs and Speech: Integration vs. Combination
by: Schmitt, Robin, et al.
Published: (2026)

Attention-weighted Centered Kernel Alignment for Knowledge Distillation in Large Audio-Language Models Applied to Speech Emotion Recognition
by: Yang, Qingran, et al.
Published: (2026)

Knowledge Distillation for Speech Denoising by Latent Representation Alignment with Cosine Distance
by: Luong, Diep, et al.
Published: (2025)

Self-Distillation Prototypes Network: Learning Robust Speaker Representations without Supervision
by: Chen, Yafeng, et al.
Published: (2024)

SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec
by: Qiang, Chunyu, et al.
Published: (2025)

Anatomy of the Modality Gap: Dissecting the Internal States of End-to-End Speech LLMs
by: Hsu, Ming-Hao, et al.
Published: (2026)

Group Relative Policy Optimization for Speech Recognition
by: Shivakumar, Prashanth Gurunath, et al.
Published: (2025)

CUEMPATHY: A Counseling Speech Dataset for Psychotherapy Research
by: Tao, Dehua, et al.
Published: (2024)

Text-aware Speech Separation for Multi-talker Keyword Spotting
by: Li, Haoyu, et al.
Published: (2024)

AS-Speech: Adaptive Style For Speech Synthesis
by: Li, Zhipeng, et al.
Published: (2024)

Enhancing Code-switched Text-to-Speech Synthesis Capability in Large Language Models with only Monolingual Corpora
by: Xu, Jing, et al.
Published: (2024)

DISPATCH: Distilling Selective Patches for Speech Enhancement
by: Kim, Dohwan, et al.
Published: (2025)