Guardado en:
Detalles Bibliográficos
Autores principales: Lu, Xiangyu, Xu, Wang, Wang, Haoyu, Zhou, Hongyun, Zhao, Haiyan, Zhu, Conghui, Zhao, Tiejun, Yang, Muyun
Formato: Preprint
Publicado: 2025
Materias:
Acceso en línea:https://arxiv.org/abs/2502.11123
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
_version_ 1866909562999668736
author Lu, Xiangyu
Xu, Wang
Wang, Haoyu
Zhou, Hongyun
Zhao, Haiyan
Zhu, Conghui
Zhao, Tiejun
Yang, Muyun
author_facet Lu, Xiangyu
Xu, Wang
Wang, Haoyu
Zhou, Hongyun
Zhao, Haiyan
Zhu, Conghui
Zhao, Tiejun
Yang, Muyun
contents Real-time speech conversation is essential for natural and efficient human-machine interactions, requiring duplex and streaming capabilities. Traditional Transformer-based conversational chatbots operate in a turn-based manner and exhibit quadratic computational complexity that grows as the input size increases. In this paper, we propose DuplexMamba, a Mamba-based end-to-end multimodal duplex model for speech-to-text conversation. DuplexMamba enables simultaneous input processing and output generation, dynamically adjusting to support real-time streaming. Specifically, we develop a Mamba-based speech encoder and adapt it with a Mamba-based language model. Furthermore, we introduce a novel duplex decoding strategy that enables DuplexMamba to process input and generate output simultaneously. Experimental results demonstrate that DuplexMamba successfully implements duplex and streaming capabilities while achieving performance comparable to several recently developed Transformer-based models in automatic speech recognition (ASR) tasks and voice assistant benchmark evaluations. Our code and model are released.
format Preprint
id arxiv_https___arxiv_org_abs_2502_11123
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle DuplexMamba: Enhancing Real-time Speech Conversations with Duplex and Streaming Capabilities
Lu, Xiangyu
Xu, Wang
Wang, Haoyu
Zhou, Hongyun
Zhao, Haiyan
Zhu, Conghui
Zhao, Tiejun
Yang, Muyun
Computation and Language
Real-time speech conversation is essential for natural and efficient human-machine interactions, requiring duplex and streaming capabilities. Traditional Transformer-based conversational chatbots operate in a turn-based manner and exhibit quadratic computational complexity that grows as the input size increases. In this paper, we propose DuplexMamba, a Mamba-based end-to-end multimodal duplex model for speech-to-text conversation. DuplexMamba enables simultaneous input processing and output generation, dynamically adjusting to support real-time streaming. Specifically, we develop a Mamba-based speech encoder and adapt it with a Mamba-based language model. Furthermore, we introduce a novel duplex decoding strategy that enables DuplexMamba to process input and generate output simultaneously. Experimental results demonstrate that DuplexMamba successfully implements duplex and streaming capabilities while achieving performance comparable to several recently developed Transformer-based models in automatic speech recognition (ASR) tasks and voice assistant benchmark evaluations. Our code and model are released.
title DuplexMamba: Enhancing Real-time Speech Conversations with Duplex and Streaming Capabilities
topic Computation and Language
url https://arxiv.org/abs/2502.11123