:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Ma, Jianbo, Cartwright, Richard
Format:	Preprint
Published:	2026
Subjects:	Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2604.19330
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

A low latency attention module for streaming self-supervised speech representation learning
by: Ma, Jianbo, et al.
Published: (2023)

Prosodic Parameter Manipulation in TTS generated speech for Controlled Speech Generation
by: Chary, Podakanti Satyajith
Published: (2024)

Adversarial speech for voice privacy protection from Personalized Speech generation
by: Chen, Shihao, et al.
Published: (2024)

Rethinking Mamba in Speech Processing by Self-Supervised Models
by: Zhang, Xiangyu, et al.
Published: (2024)

FlexSpeech: Towards Stable, Controllable and Expressive Text-to-Speech
by: Ma, Linhan, et al.
Published: (2025)

Interleaved Speech-Text Language Models for Simple Streaming Text-to-Speech Synthesis
by: Yang, Yifan, et al.
Published: (2024)

SLM-S2ST: A multimodal language model for direct speech-to-speech translation
by: Hu, Yuxuan, et al.
Published: (2025)

Text2Move: Text-to-moving sound generation via trajectory prediction and temporal alignment
by: Liu, Yunyi, et al.
Published: (2025)

Joint decoding method for controllable contextual speech recognition based on Speech LLM
by: Fang, Yangui, et al.
Published: (2025)

Audio-conditioned phonemic and prosodic annotation for building text-to-speech models from unlabeled speech data
by: Shirahata, Yuma, et al.
Published: (2024)

Boosting Diffusion Model for Spectrogram Up-sampling in Text-to-speech: An Empirical Study
by: Zhang, Chong, et al.
Published: (2024)

Robust fine-tuning of speech recognition models via model merging: application to disordered speech
by: Ducorroy, Alexandre, et al.
Published: (2025)

Compositional Audio Representation Learning
by: Sridhar, Sripathi, et al.
Published: (2024)

A predictive learning model can simulate temporal dynamics and context effects found in neural representations of continuous speech
by: Liu, Oli Danyi, et al.
Published: (2024)

BFA: Real-time Multilingual Text-to-speech Forced Alignment
by: Rehman, Abdul, et al.
Published: (2025)

Phoneme-based speech recognition driven by large language models and sampling marginalization
by: Ma, Te, et al.
Published: (2025)

TI-ASU: Toward Robust Automatic Speech Understanding through Text-to-speech Imputation Against Missing Speech Modality
by: Feng, Tiantian, et al.
Published: (2024)

Speech Quality-Based Localization of Low-Quality Speech and Text-to-Speech Synthesis Artefacts
by: Kuhlmann, Michael, et al.
Published: (2026)

Advancing Electrolaryngeal Speech Enhancement Through Speech-Text Representation Learning
by: Ma, Ding, et al.
Published: (2026)

A unified multichannel far-field speech recognition system: combining neural beamforming with attention based end-to-end model
by: Zhao, Dongdi, et al.
Published: (2024)

Speech Codec Probing from Semantic and Phonetic Perspectives
by: Shi, Xuan, et al.
Published: (2026)

Lightweight speech enhancement guided target speech extraction in noisy multi-speaker scenarios
by: Huang, Ziling, et al.
Published: (2025)

Granite-speech: open-source speech-aware LLMs with strong English ASR capabilities
by: Saon, George, et al.
Published: (2025)

Explainable speech emotion recognition through attentive pooling: insights from attention-based temporal localization
by: Leygue, Tahitoa, et al.
Published: (2025)

TASU: Text-Only Alignment for Speech Understanding
by: Peng, Jing, et al.
Published: (2025)

Position: Towards Responsible Evaluation for Text-to-Speech
by: Yang, Yifan, et al.
Published: (2025)

MELA-TTS: Joint transformer-diffusion model with representation alignment for speech synthesis
by: An, Keyu, et al.
Published: (2025)

DQR-TTS: Semi-supervised Text-to-speech Synthesis with Dynamic Quantized Representation
by: Wang, Jianzong, et al.
Published: (2023)

Assessing speech quality metrics for evaluation of neural audio codecs under clean speech conditions
by: Mack, Wolfgang, et al.
Published: (2025)

Thinking in cocktail party: Chain-of-Thought and reinforcement learning for target speaker automatic speech recognition
by: Zhang, Yiru, et al.
Published: (2025)

An interpretable speech foundation model for depression detection by revealing prediction-relevant acoustic features from long speech
by: Deng, Qingkun, et al.
Published: (2024)

Text-guided HuBERT: Self-Supervised Speech Pre-training via Generative Adversarial Networks
by: Ma, Duo, et al.
Published: (2024)

Probing mental health information in speech foundation models
by: de Gennes, Marc, et al.
Published: (2024)

WhisperFlow: speech foundation models in real time
by: Wang, Rongxiang, et al.
Published: (2024)

Chain-Talker: Chain Understanding and Rendering for Empathetic Conversational Speech Synthesis
by: Hu, Yifan, et al.
Published: (2025)

Predicting speech intelligibility in older adults for speech enhancement using the Gammachirp Envelope Similarity Index, GESI
by: Yamamoto, Ayako, et al.
Published: (2025)

Beyond Video-to-SFX: Video to Audio Synthesis with Environmentally Aware Speech
by: Niu, Xinlei, et al.
Published: (2025)

TTS-CtrlNet: Time varying emotion aligned text-to-speech generation with ControlNet
by: Jeong, Jaeseok, et al.
Published: (2025)

Debatts: Zero-Shot Debating Text-to-Speech Synthesis
by: Huang, Yiqiao, et al.
Published: (2024)

I2TTS: Image-indicated Immersive Text-to-speech Synthesis with Spatial Perception
by: Zhang, Jiawei, et al.
Published: (2024)