Saved in:
| Main Authors: | Zhu, Han, Kang, Wei, Guo, Liyong, Yao, Zengwei, Kuang, Fangjun, Zhuang, Weiji, Li, Zhaoqing, Han, Zhifeng, Zhang, Dong, Zhang, Xin, Song, Xingchen, Ye, Lingxuan, Lin, Long, Povey, Daniel |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2507.09318 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching
by: Zhu, Han, et al.
Published: (2025)
by: Zhu, Han, et al.
Published: (2025)
Flow2GAN: Hybrid Flow Matching and GAN with Multi-Resolution Network for Few-step High-Fidelity Audio Generation
by: Yao, Zengwei, et al.
Published: (2025)
by: Yao, Zengwei, et al.
Published: (2025)
OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models
by: Zhu, Han, et al.
Published: (2026)
by: Zhu, Han, et al.
Published: (2026)
CR-CTC: Consistency regularization on CTC for improved speech recognition
by: Yao, Zengwei, et al.
Published: (2024)
by: Yao, Zengwei, et al.
Published: (2024)
Libriheavy: a 50,000 hours ASR corpus with punctuation casing and context
by: Kang, Wei, et al.
Published: (2023)
by: Kang, Wei, et al.
Published: (2023)
PromptASR for contextualized ASR with controllable style
by: Yang, Xiaoyu, et al.
Published: (2023)
by: Yang, Xiaoyu, et al.
Published: (2023)
Zipformer: A faster and better encoder for automatic speech recognition
by: Yao, Zengwei, et al.
Published: (2023)
by: Yao, Zengwei, et al.
Published: (2023)
k2SSL: A Faster and Better Framework for Self-Supervised Speech Representation Learning
by: Yang, Yifan, et al.
Published: (2024)
by: Yang, Yifan, et al.
Published: (2024)
LibriheavyMix: A 20,000-Hour Dataset for Single-Channel Reverberant Multi-Talker Speech Separation, ASR and Speaker Diarization
by: Jin, Zengrui, et al.
Published: (2024)
by: Jin, Zengrui, et al.
Published: (2024)
SemaVoice: Semantic-Aware Continuous Autoregressive Speech Synthesis
by: Wang, Huimeng, et al.
Published: (2026)
by: Wang, Huimeng, et al.
Published: (2026)
Spoken DialogSum: An Emotion-Rich Conversational Dataset for Spoken Dialogue Summarization
by: Lu, Yen-Ju, et al.
Published: (2025)
by: Lu, Yen-Ju, et al.
Published: (2025)
UltraVoice: Scaling Fine-Grained Style-Controlled Speech Conversations for Spoken Dialogue Models
by: Tu, Wenming, et al.
Published: (2025)
by: Tu, Wenming, et al.
Published: (2025)
Flow-TSVAD: Target-Speaker Voice Activity Detection via Latent Flow Matching
by: Chen, Zhengyang, et al.
Published: (2024)
by: Chen, Zhengyang, et al.
Published: (2024)
UniVoice: Unifying Autoregressive ASR and Flow-Matching based TTS with Large Language Models
by: Guan, Wenhao, et al.
Published: (2025)
by: Guan, Wenhao, et al.
Published: (2025)
Retrieval Augmented End-to-End Spoken Dialog Models
by: Wang, Mingqiu, et al.
Published: (2024)
by: Wang, Mingqiu, et al.
Published: (2024)
Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue
by: Lin, Guan-Ting, et al.
Published: (2023)
by: Lin, Guan-Ting, et al.
Published: (2023)
LLM-Enhanced Dialogue Management for Full-Duplex Spoken Dialogue Systems
by: Zhang, Hao, et al.
Published: (2025)
by: Zhang, Hao, et al.
Published: (2025)
FeruzaSpeech: A 60 Hour Uzbek Read Speech Corpus with Punctuation, Casing, and Context
by: Povey, Anna, et al.
Published: (2024)
by: Povey, Anna, et al.
Published: (2024)
CoVoMix2: Advancing Zero-Shot Dialogue Generation with Fully Non-Autoregressive Flow Matching
by: Zhang, Leying, et al.
Published: (2025)
by: Zhang, Leying, et al.
Published: (2025)
J-CHAT: Japanese Large-scale Spoken Dialogue Corpus for Spoken Dialogue Language Modeling
by: Nakata, Wataru, et al.
Published: (2024)
by: Nakata, Wataru, et al.
Published: (2024)
E-chat: Emotion-sensitive Spoken Dialogue System with Large Language Models
by: Xue, Hongfei, et al.
Published: (2023)
by: Xue, Hongfei, et al.
Published: (2023)
FlashLabs Chroma 1.0: A Real-Time End-to-End Spoken Dialogue Model with Personalized Voice Cloning
by: Chen, Tanyu, et al.
Published: (2026)
by: Chen, Tanyu, et al.
Published: (2026)
Spoken Language Corpora Augmentation with Domain-Specific Voice-Cloned Speech
by: Czyżnikiewicz, Mateusz, et al.
Published: (2024)
by: Czyżnikiewicz, Mateusz, et al.
Published: (2024)
LatentVoiceGrad: Nonparallel Voice Conversion with Latent Diffusion/Flow-Matching Models
by: Kameoka, Hirokazu, et al.
Published: (2025)
by: Kameoka, Hirokazu, et al.
Published: (2025)
Enhancing Expressive Voice Conversion with Discrete Pitch-Conditioned Flow Matching Model
by: Zuo, Jialong, et al.
Published: (2025)
by: Zuo, Jialong, et al.
Published: (2025)
SLIDE: Integrating Speech Language Model with LLM for Spontaneous Spoken Dialogue Generation
by: Lu, Haitian, et al.
Published: (2025)
by: Lu, Haitian, et al.
Published: (2025)
Predictive Speech Recognition and End-of-Utterance Detection Towards Spoken Dialog Systems
by: Zink, Oswald, et al.
Published: (2024)
by: Zink, Oswald, et al.
Published: (2024)
VStyle: A Benchmark for Voice Style Adaptation with Spoken Instructions
by: Zhan, Jun, et al.
Published: (2025)
by: Zhan, Jun, et al.
Published: (2025)
DialoSpeech: Dual-Speaker Dialogue Generation with LLM and Flow Matching
by: Xie, Hanke, et al.
Published: (2025)
by: Xie, Hanke, et al.
Published: (2025)
WaveFM: A High-Fidelity and Efficient Vocoder Based on Flow Matching
by: Luo, Tianze, et al.
Published: (2025)
by: Luo, Tianze, et al.
Published: (2025)
VoiceRestore: Flow-Matching Transformers for Speech Recording Quality Restoration
by: Kirdey, Stanislav
Published: (2025)
by: Kirdey, Stanislav
Published: (2025)
Towards a Japanese Full-duplex Spoken Dialogue System
by: Ohashi, Atsumoto, et al.
Published: (2025)
by: Ohashi, Atsumoto, et al.
Published: (2025)
Semantic-Aware Interruption Detection in Spoken Dialogue Systems: Benchmark, Metric, and Model
by: Xia, Kangxiang, et al.
Published: (2026)
by: Xia, Kangxiang, et al.
Published: (2026)
An Efficient Self-Learning Framework For Interactive Spoken Dialog Systems
by: Tulsiani, Hitesh, et al.
Published: (2024)
by: Tulsiani, Hitesh, et al.
Published: (2024)
DeepDialogue: A Multi-Turn Emotionally-Rich Spoken Dialogue Dataset
by: Koudounas, Alkis, et al.
Published: (2025)
by: Koudounas, Alkis, et al.
Published: (2025)
SwanVoice: Expressive Long-Form Zero-Shot Speech Synthesis for Both Monologue and Dialogue
by: Li, Ruiqi, et al.
Published: (2026)
by: Li, Ruiqi, et al.
Published: (2026)
Audio Dialogues: Dialogues dataset for audio and music understanding
by: Goel, Arushi, et al.
Published: (2024)
by: Goel, Arushi, et al.
Published: (2024)
Streaming Endpointer for Spoken Dialogue using Neural Audio Codecs and Label-Delayed Training
by: Udupa, Sathvik, et al.
Published: (2025)
by: Udupa, Sathvik, et al.
Published: (2025)
EMO-Reasoning: Benchmarking Emotional Reasoning Capabilities in Spoken Dialogue Systems
by: Liu, Jingwen, et al.
Published: (2025)
by: Liu, Jingwen, et al.
Published: (2025)
FlowSE: Flow Matching-based Speech Enhancement
by: Lee, Seonggyu, et al.
Published: (2025)
by: Lee, Seonggyu, et al.
Published: (2025)
Similar Items
-
ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching
by: Zhu, Han, et al.
Published: (2025) -
Flow2GAN: Hybrid Flow Matching and GAN with Multi-Resolution Network for Few-step High-Fidelity Audio Generation
by: Yao, Zengwei, et al.
Published: (2025) -
OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models
by: Zhu, Han, et al.
Published: (2026) -
CR-CTC: Consistency regularization on CTC for improved speech recognition
by: Yao, Zengwei, et al.
Published: (2024) -
Libriheavy: a 50,000 hours ASR corpus with punctuation casing and context
by: Kang, Wei, et al.
Published: (2023)