Saved in:
| Main Authors: | Ma, Jianbo, Cartwright, Richard |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.19330 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
A low latency attention module for streaming self-supervised speech representation learning
by: Ma, Jianbo, et al.
Published: (2023)
by: Ma, Jianbo, et al.
Published: (2023)
Prosodic Parameter Manipulation in TTS generated speech for Controlled Speech Generation
by: Chary, Podakanti Satyajith
Published: (2024)
by: Chary, Podakanti Satyajith
Published: (2024)
Adversarial speech for voice privacy protection from Personalized Speech generation
by: Chen, Shihao, et al.
Published: (2024)
by: Chen, Shihao, et al.
Published: (2024)
Rethinking Mamba in Speech Processing by Self-Supervised Models
by: Zhang, Xiangyu, et al.
Published: (2024)
by: Zhang, Xiangyu, et al.
Published: (2024)
FlexSpeech: Towards Stable, Controllable and Expressive Text-to-Speech
by: Ma, Linhan, et al.
Published: (2025)
by: Ma, Linhan, et al.
Published: (2025)
Interleaved Speech-Text Language Models for Simple Streaming Text-to-Speech Synthesis
by: Yang, Yifan, et al.
Published: (2024)
by: Yang, Yifan, et al.
Published: (2024)
SLM-S2ST: A multimodal language model for direct speech-to-speech translation
by: Hu, Yuxuan, et al.
Published: (2025)
by: Hu, Yuxuan, et al.
Published: (2025)
Text2Move: Text-to-moving sound generation via trajectory prediction and temporal alignment
by: Liu, Yunyi, et al.
Published: (2025)
by: Liu, Yunyi, et al.
Published: (2025)
Joint decoding method for controllable contextual speech recognition based on Speech LLM
by: Fang, Yangui, et al.
Published: (2025)
by: Fang, Yangui, et al.
Published: (2025)
Audio-conditioned phonemic and prosodic annotation for building text-to-speech models from unlabeled speech data
by: Shirahata, Yuma, et al.
Published: (2024)
by: Shirahata, Yuma, et al.
Published: (2024)
Boosting Diffusion Model for Spectrogram Up-sampling in Text-to-speech: An Empirical Study
by: Zhang, Chong, et al.
Published: (2024)
by: Zhang, Chong, et al.
Published: (2024)
Robust fine-tuning of speech recognition models via model merging: application to disordered speech
by: Ducorroy, Alexandre, et al.
Published: (2025)
by: Ducorroy, Alexandre, et al.
Published: (2025)
Compositional Audio Representation Learning
by: Sridhar, Sripathi, et al.
Published: (2024)
by: Sridhar, Sripathi, et al.
Published: (2024)
A predictive learning model can simulate temporal dynamics and context effects found in neural representations of continuous speech
by: Liu, Oli Danyi, et al.
Published: (2024)
by: Liu, Oli Danyi, et al.
Published: (2024)
BFA: Real-time Multilingual Text-to-speech Forced Alignment
by: Rehman, Abdul, et al.
Published: (2025)
by: Rehman, Abdul, et al.
Published: (2025)
Phoneme-based speech recognition driven by large language models and sampling marginalization
by: Ma, Te, et al.
Published: (2025)
by: Ma, Te, et al.
Published: (2025)
TI-ASU: Toward Robust Automatic Speech Understanding through Text-to-speech Imputation Against Missing Speech Modality
by: Feng, Tiantian, et al.
Published: (2024)
by: Feng, Tiantian, et al.
Published: (2024)
Speech Quality-Based Localization of Low-Quality Speech and Text-to-Speech Synthesis Artefacts
by: Kuhlmann, Michael, et al.
Published: (2026)
by: Kuhlmann, Michael, et al.
Published: (2026)
Advancing Electrolaryngeal Speech Enhancement Through Speech-Text Representation Learning
by: Ma, Ding, et al.
Published: (2026)
by: Ma, Ding, et al.
Published: (2026)
A unified multichannel far-field speech recognition system: combining neural beamforming with attention based end-to-end model
by: Zhao, Dongdi, et al.
Published: (2024)
by: Zhao, Dongdi, et al.
Published: (2024)
Speech Codec Probing from Semantic and Phonetic Perspectives
by: Shi, Xuan, et al.
Published: (2026)
by: Shi, Xuan, et al.
Published: (2026)
Lightweight speech enhancement guided target speech extraction in noisy multi-speaker scenarios
by: Huang, Ziling, et al.
Published: (2025)
by: Huang, Ziling, et al.
Published: (2025)
Granite-speech: open-source speech-aware LLMs with strong English ASR capabilities
by: Saon, George, et al.
Published: (2025)
by: Saon, George, et al.
Published: (2025)
Explainable speech emotion recognition through attentive pooling: insights from attention-based temporal localization
by: Leygue, Tahitoa, et al.
Published: (2025)
by: Leygue, Tahitoa, et al.
Published: (2025)
TASU: Text-Only Alignment for Speech Understanding
by: Peng, Jing, et al.
Published: (2025)
by: Peng, Jing, et al.
Published: (2025)
Position: Towards Responsible Evaluation for Text-to-Speech
by: Yang, Yifan, et al.
Published: (2025)
by: Yang, Yifan, et al.
Published: (2025)
MELA-TTS: Joint transformer-diffusion model with representation alignment for speech synthesis
by: An, Keyu, et al.
Published: (2025)
by: An, Keyu, et al.
Published: (2025)
DQR-TTS: Semi-supervised Text-to-speech Synthesis with Dynamic Quantized Representation
by: Wang, Jianzong, et al.
Published: (2023)
by: Wang, Jianzong, et al.
Published: (2023)
Assessing speech quality metrics for evaluation of neural audio codecs under clean speech conditions
by: Mack, Wolfgang, et al.
Published: (2025)
by: Mack, Wolfgang, et al.
Published: (2025)
Thinking in cocktail party: Chain-of-Thought and reinforcement learning for target speaker automatic speech recognition
by: Zhang, Yiru, et al.
Published: (2025)
by: Zhang, Yiru, et al.
Published: (2025)
An interpretable speech foundation model for depression detection by revealing prediction-relevant acoustic features from long speech
by: Deng, Qingkun, et al.
Published: (2024)
by: Deng, Qingkun, et al.
Published: (2024)
Text-guided HuBERT: Self-Supervised Speech Pre-training via Generative Adversarial Networks
by: Ma, Duo, et al.
Published: (2024)
by: Ma, Duo, et al.
Published: (2024)
Probing mental health information in speech foundation models
by: de Gennes, Marc, et al.
Published: (2024)
by: de Gennes, Marc, et al.
Published: (2024)
WhisperFlow: speech foundation models in real time
by: Wang, Rongxiang, et al.
Published: (2024)
by: Wang, Rongxiang, et al.
Published: (2024)
Chain-Talker: Chain Understanding and Rendering for Empathetic Conversational Speech Synthesis
by: Hu, Yifan, et al.
Published: (2025)
by: Hu, Yifan, et al.
Published: (2025)
Predicting speech intelligibility in older adults for speech enhancement using the Gammachirp Envelope Similarity Index, GESI
by: Yamamoto, Ayako, et al.
Published: (2025)
by: Yamamoto, Ayako, et al.
Published: (2025)
Beyond Video-to-SFX: Video to Audio Synthesis with Environmentally Aware Speech
by: Niu, Xinlei, et al.
Published: (2025)
by: Niu, Xinlei, et al.
Published: (2025)
TTS-CtrlNet: Time varying emotion aligned text-to-speech generation with ControlNet
by: Jeong, Jaeseok, et al.
Published: (2025)
by: Jeong, Jaeseok, et al.
Published: (2025)
Debatts: Zero-Shot Debating Text-to-Speech Synthesis
by: Huang, Yiqiao, et al.
Published: (2024)
by: Huang, Yiqiao, et al.
Published: (2024)
I2TTS: Image-indicated Immersive Text-to-speech Synthesis with Spatial Perception
by: Zhang, Jiawei, et al.
Published: (2024)
by: Zhang, Jiawei, et al.
Published: (2024)
Similar Items
-
A low latency attention module for streaming self-supervised speech representation learning
by: Ma, Jianbo, et al.
Published: (2023) -
Prosodic Parameter Manipulation in TTS generated speech for Controlled Speech Generation
by: Chary, Podakanti Satyajith
Published: (2024) -
Adversarial speech for voice privacy protection from Personalized Speech generation
by: Chen, Shihao, et al.
Published: (2024) -
Rethinking Mamba in Speech Processing by Self-Supervised Models
by: Zhang, Xiangyu, et al.
Published: (2024) -
FlexSpeech: Towards Stable, Controllable and Expressive Text-to-Speech
by: Ma, Linhan, et al.
Published: (2025)