:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Li, Yinghao Aaron, Jiang, Xilin, Tao, Fei, Niu, Cheng, Xu, Kaifeng, Song, Juntong, Mesgarani, Nima
Format:	Preprint
Published:	2025
Subjects:	Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2507.14988
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion
by: Li, Yinghao Aaron, et al.
Published: (2024)

DeCoR: Defy Knowledge Forgetting by Predicting Earlier Audio Codes
by: Jiang, Xilin, et al.
Published: (2023)

Speech Slytherin: Examining the Performance and Efficiency of Mamba for Speech Separation, Recognition, and Synthesis
by: Jiang, Xilin, et al.
Published: (2024)

Listen, Chat, and Remix: Text-Guided Soundscape Remixing for Enhanced Auditory Experience
by: Jiang, Xilin, et al.
Published: (2024)

Dual-path Mamba: Short and Long-term Bidirectional Selective Structured State Space Models for Speech Separation
by: Jiang, Xilin, et al.
Published: (2024)

Style-Talker: Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation
by: Li, Yinghao Aaron, et al.
Published: (2024)

MeanFlow-TSE: One-Step Generative Target Speaker Extraction with Mean Flow
by: Shimizu, Riki, et al.
Published: (2025)

Layer-wise Minimal Pair Probing Reveals Contextual Grammatical-Conceptual Hierarchy in Speech Representations
by: He, Linyang, et al.
Published: (2025)

DMOSpeech: Direct Metric Optimization via Distilled Diffusion Model in Zero-Shot Speech Synthesis
by: Li, Yingahao Aaron, et al.
Published: (2024)

SSAMBA: Self-Supervised Audio Representation Learning with Mamba State Space Model
by: Shams, Siavash, et al.
Published: (2024)

Exploring Finetuned Audio-LLM on Heart Murmur Features
by: Florea, Adrian, et al.
Published: (2025)

Just ASR + LLM? A Study on Speech Large Language Models' Ability to Identify and Understand Speaker in Spoken Dialogue
by: Wu, Junkai, et al.
Published: (2024)

Bridging Ears and Eyes: Analyzing Audio and Visual Large Language Models to Humans in Visible Sound Recognition and Reducing Their Sensory Gap via Cross-Modal Distillation
by: Jiang, Xilin, et al.
Published: (2025)

SightSound-R1: Cross-Modal Reasoning Distillation from Vision to Audio Language Models
by: Wang, Qiaolin, et al.
Published: (2025)

Interpretable Embeddings of Speech Enhance and Explain Brain Encoding Performance of Audio Models
by: Shimizu, Riki, et al.
Published: (2025)

DeepSpeech models show Human-like Performance and Processing of Cochlear Implant Inputs
by: Steinhardt, Cynthia R., et al.
Published: (2024)

Total-Duration-Aware Duration Modeling for Text-to-Speech Systems
by: Eskimez, Sefik Emre, et al.
Published: (2024)

FNH-TTS: Mixture-of-Experts Duration Modeling for Robust Neural Speech Synthesis
by: Meng, Qingliang, et al.
Published: (2025)

Neuro2Semantic: A Transfer Learning Framework for Semantic Reconstruction of Continuous Language from Human Intracranial EEG
by: Shams, Siavash, et al.
Published: (2025)

Enhancing In-the-Wild Speech Emotion Conversion with Resynthesis-based Duration Modeling
by: Prabhu, Navin Raj, et al.
Published: (2025)

DurIAN-E 2: Duration Informed Attention Network with Adaptive Variational Autoencoder and Adversarial Learning for Expressive Text-to-Speech Synthesis
by: Gu, Yu, et al.
Published: (2024)

AS-Speech: Adaptive Style For Speech Synthesis
by: Li, Zhipeng, et al.
Published: (2024)

Rate-Aware Learned Speech Compression
by: Xu, Jun, et al.
Published: (2025)

DubWise: Video-Guided Speech Duration Control in Multimodal LLM-based Text-to-Speech for Dubbing
by: Sahipjohn, Neha, et al.
Published: (2024)

Speech Token Prediction via Compressed-to-fine Language Modeling for Speech Generation
by: Liu, Wenrui, et al.
Published: (2025)

Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody Modeling
by: Jiang, Yuepeng, et al.
Published: (2024)

A Multi-Stage Framework for Multimodal Controllable Speech Synthesis
by: Niu, Rui, et al.
Published: (2025)

Adaptive Duration Model for Text Speech Alignment
by: Cao, Junjie
Published: (2025)

AAD-LLM: Neural Attention-Driven Auditory Scene Understanding
by: Jiang, Xilin, et al.
Published: (2025)

The Overview of Segmental Durations Modification Algorithms on Speech Signal Characteristics
by: Jang, Kyeomeun, et al.
Published: (2025)

Distinguishing Neural Speech Synthesis Models Through Fingerprints in Speech Waveforms
by: Zhang, Chu Yuan, et al.
Published: (2023)

FlowSE-GRPO: Training Flow Matching Speech Enhancement via Online Reinforcement Learning
by: Wang, Haoxu, et al.
Published: (2026)

Hierarchical Emotion Prediction and Control in Text-to-Speech Synthesis
by: Inoue, Sho, et al.
Published: (2024)

ToneUnit: A Speech Discretization Approach for Tonal Language Speech Synthesis
by: Tao, Dehua, et al.
Published: (2024)

Semantic-VAE: Semantic-Alignment Latent Representation for Better Speech Synthesis
by: Niu, Zhikang, et al.
Published: (2025)

ASRRL-TTS: Agile Speaker Representation Reinforcement Learning for Text-to-Speech Speaker Adaptation
by: Fu, Ruibo, et al.
Published: (2024)

A Neural Speech Codec for Noise Robust Speech Coding
by: Huang, Jiayi, et al.
Published: (2023)

ARECHO: Autoregressive Evaluation via Chain-Based Hypothesis Optimization for Speech Multi-Metric Estimation
by: Shi, Jiatong, et al.
Published: (2025)

MSceneSpeech: A Multi-Scene Speech Dataset For Expressive Speech Synthesis
by: Yang, Qian, et al.
Published: (2024)

A Hybrid Discriminative and Generative System for Universal Speech Enhancement
by: Liu, Yinghao, et al.
Published: (2026)