Saved in:
| Main Authors: | Li, Yinghao Aaron, Jiang, Xilin, Tao, Fei, Niu, Cheng, Xu, Kaifeng, Song, Juntong, Mesgarani, Nima |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2507.14988 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion
by: Li, Yinghao Aaron, et al.
Published: (2024)
by: Li, Yinghao Aaron, et al.
Published: (2024)
DeCoR: Defy Knowledge Forgetting by Predicting Earlier Audio Codes
by: Jiang, Xilin, et al.
Published: (2023)
by: Jiang, Xilin, et al.
Published: (2023)
Speech Slytherin: Examining the Performance and Efficiency of Mamba for Speech Separation, Recognition, and Synthesis
by: Jiang, Xilin, et al.
Published: (2024)
by: Jiang, Xilin, et al.
Published: (2024)
Listen, Chat, and Remix: Text-Guided Soundscape Remixing for Enhanced Auditory Experience
by: Jiang, Xilin, et al.
Published: (2024)
by: Jiang, Xilin, et al.
Published: (2024)
Dual-path Mamba: Short and Long-term Bidirectional Selective Structured State Space Models for Speech Separation
by: Jiang, Xilin, et al.
Published: (2024)
by: Jiang, Xilin, et al.
Published: (2024)
Style-Talker: Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation
by: Li, Yinghao Aaron, et al.
Published: (2024)
by: Li, Yinghao Aaron, et al.
Published: (2024)
MeanFlow-TSE: One-Step Generative Target Speaker Extraction with Mean Flow
by: Shimizu, Riki, et al.
Published: (2025)
by: Shimizu, Riki, et al.
Published: (2025)
Layer-wise Minimal Pair Probing Reveals Contextual Grammatical-Conceptual Hierarchy in Speech Representations
by: He, Linyang, et al.
Published: (2025)
by: He, Linyang, et al.
Published: (2025)
DMOSpeech: Direct Metric Optimization via Distilled Diffusion Model in Zero-Shot Speech Synthesis
by: Li, Yingahao Aaron, et al.
Published: (2024)
by: Li, Yingahao Aaron, et al.
Published: (2024)
SSAMBA: Self-Supervised Audio Representation Learning with Mamba State Space Model
by: Shams, Siavash, et al.
Published: (2024)
by: Shams, Siavash, et al.
Published: (2024)
Exploring Finetuned Audio-LLM on Heart Murmur Features
by: Florea, Adrian, et al.
Published: (2025)
by: Florea, Adrian, et al.
Published: (2025)
Just ASR + LLM? A Study on Speech Large Language Models' Ability to Identify and Understand Speaker in Spoken Dialogue
by: Wu, Junkai, et al.
Published: (2024)
by: Wu, Junkai, et al.
Published: (2024)
Bridging Ears and Eyes: Analyzing Audio and Visual Large Language Models to Humans in Visible Sound Recognition and Reducing Their Sensory Gap via Cross-Modal Distillation
by: Jiang, Xilin, et al.
Published: (2025)
by: Jiang, Xilin, et al.
Published: (2025)
SightSound-R1: Cross-Modal Reasoning Distillation from Vision to Audio Language Models
by: Wang, Qiaolin, et al.
Published: (2025)
by: Wang, Qiaolin, et al.
Published: (2025)
Interpretable Embeddings of Speech Enhance and Explain Brain Encoding Performance of Audio Models
by: Shimizu, Riki, et al.
Published: (2025)
by: Shimizu, Riki, et al.
Published: (2025)
DeepSpeech models show Human-like Performance and Processing of Cochlear Implant Inputs
by: Steinhardt, Cynthia R., et al.
Published: (2024)
by: Steinhardt, Cynthia R., et al.
Published: (2024)
Total-Duration-Aware Duration Modeling for Text-to-Speech Systems
by: Eskimez, Sefik Emre, et al.
Published: (2024)
by: Eskimez, Sefik Emre, et al.
Published: (2024)
FNH-TTS: Mixture-of-Experts Duration Modeling for Robust Neural Speech Synthesis
by: Meng, Qingliang, et al.
Published: (2025)
by: Meng, Qingliang, et al.
Published: (2025)
Neuro2Semantic: A Transfer Learning Framework for Semantic Reconstruction of Continuous Language from Human Intracranial EEG
by: Shams, Siavash, et al.
Published: (2025)
by: Shams, Siavash, et al.
Published: (2025)
Enhancing In-the-Wild Speech Emotion Conversion with Resynthesis-based Duration Modeling
by: Prabhu, Navin Raj, et al.
Published: (2025)
by: Prabhu, Navin Raj, et al.
Published: (2025)
DurIAN-E 2: Duration Informed Attention Network with Adaptive Variational Autoencoder and Adversarial Learning for Expressive Text-to-Speech Synthesis
by: Gu, Yu, et al.
Published: (2024)
by: Gu, Yu, et al.
Published: (2024)
AS-Speech: Adaptive Style For Speech Synthesis
by: Li, Zhipeng, et al.
Published: (2024)
by: Li, Zhipeng, et al.
Published: (2024)
Rate-Aware Learned Speech Compression
by: Xu, Jun, et al.
Published: (2025)
by: Xu, Jun, et al.
Published: (2025)
DubWise: Video-Guided Speech Duration Control in Multimodal LLM-based Text-to-Speech for Dubbing
by: Sahipjohn, Neha, et al.
Published: (2024)
by: Sahipjohn, Neha, et al.
Published: (2024)
Speech Token Prediction via Compressed-to-fine Language Modeling for Speech Generation
by: Liu, Wenrui, et al.
Published: (2025)
by: Liu, Wenrui, et al.
Published: (2025)
Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody Modeling
by: Jiang, Yuepeng, et al.
Published: (2024)
by: Jiang, Yuepeng, et al.
Published: (2024)
A Multi-Stage Framework for Multimodal Controllable Speech Synthesis
by: Niu, Rui, et al.
Published: (2025)
by: Niu, Rui, et al.
Published: (2025)
Adaptive Duration Model for Text Speech Alignment
by: Cao, Junjie
Published: (2025)
by: Cao, Junjie
Published: (2025)
AAD-LLM: Neural Attention-Driven Auditory Scene Understanding
by: Jiang, Xilin, et al.
Published: (2025)
by: Jiang, Xilin, et al.
Published: (2025)
The Overview of Segmental Durations Modification Algorithms on Speech Signal Characteristics
by: Jang, Kyeomeun, et al.
Published: (2025)
by: Jang, Kyeomeun, et al.
Published: (2025)
Distinguishing Neural Speech Synthesis Models Through Fingerprints in Speech Waveforms
by: Zhang, Chu Yuan, et al.
Published: (2023)
by: Zhang, Chu Yuan, et al.
Published: (2023)
FlowSE-GRPO: Training Flow Matching Speech Enhancement via Online Reinforcement Learning
by: Wang, Haoxu, et al.
Published: (2026)
by: Wang, Haoxu, et al.
Published: (2026)
Hierarchical Emotion Prediction and Control in Text-to-Speech Synthesis
by: Inoue, Sho, et al.
Published: (2024)
by: Inoue, Sho, et al.
Published: (2024)
ToneUnit: A Speech Discretization Approach for Tonal Language Speech Synthesis
by: Tao, Dehua, et al.
Published: (2024)
by: Tao, Dehua, et al.
Published: (2024)
Semantic-VAE: Semantic-Alignment Latent Representation for Better Speech Synthesis
by: Niu, Zhikang, et al.
Published: (2025)
by: Niu, Zhikang, et al.
Published: (2025)
ASRRL-TTS: Agile Speaker Representation Reinforcement Learning for Text-to-Speech Speaker Adaptation
by: Fu, Ruibo, et al.
Published: (2024)
by: Fu, Ruibo, et al.
Published: (2024)
A Neural Speech Codec for Noise Robust Speech Coding
by: Huang, Jiayi, et al.
Published: (2023)
by: Huang, Jiayi, et al.
Published: (2023)
ARECHO: Autoregressive Evaluation via Chain-Based Hypothesis Optimization for Speech Multi-Metric Estimation
by: Shi, Jiatong, et al.
Published: (2025)
by: Shi, Jiatong, et al.
Published: (2025)
MSceneSpeech: A Multi-Scene Speech Dataset For Expressive Speech Synthesis
by: Yang, Qian, et al.
Published: (2024)
by: Yang, Qian, et al.
Published: (2024)
A Hybrid Discriminative and Generative System for Universal Speech Enhancement
by: Liu, Yinghao, et al.
Published: (2026)
by: Liu, Yinghao, et al.
Published: (2026)
Similar Items
-
StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion
by: Li, Yinghao Aaron, et al.
Published: (2024) -
DeCoR: Defy Knowledge Forgetting by Predicting Earlier Audio Codes
by: Jiang, Xilin, et al.
Published: (2023) -
Speech Slytherin: Examining the Performance and Efficiency of Mamba for Speech Separation, Recognition, and Synthesis
by: Jiang, Xilin, et al.
Published: (2024) -
Listen, Chat, and Remix: Text-Guided Soundscape Remixing for Enhanced Auditory Experience
by: Jiang, Xilin, et al.
Published: (2024) -
Dual-path Mamba: Short and Long-term Bidirectional Selective Structured State Space Models for Speech Separation
by: Jiang, Xilin, et al.
Published: (2024)