Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Ouyang, Siqi, Xu, Xi, Dandekar, Chinmay, Li, Lei
Format:	Preprint
Published:	2024
Subjects:	Computation and Language Artificial Intelligence
Online Access:	https://arxiv.org/abs/2408.09430
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916361292218368
author	Ouyang, Siqi Xu, Xi Dandekar, Chinmay Li, Lei
author_facet	Ouyang, Siqi Xu, Xi Dandekar, Chinmay Li, Lei
contents	Simultaneous speech translation (SST) takes streaming speech input and generates text translation on the fly. Existing methods either have high latency due to recomputation of input representations, or fall behind of offline ST in translation quality. In this paper, we propose FASST, a fast large language model based method for streaming speech translation. We propose blockwise-causal speech encoding and consistency mask, so that streaming speech input can be encoded incrementally without recomputation. Furthermore, we develop a two-stage training strategy to optimize FASST for simultaneous inference. We evaluate FASST and multiple strong prior models on MuST-C dataset. Experiment results show that FASST achieves the best quality-latency trade-off. It outperforms the previous best model by an average of 1.5 BLEU under the same latency for English to Spanish translation.
format	Preprint
id	arxiv_https___arxiv_org_abs_2408_09430
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	FASST: Fast LLM-based Simultaneous Speech Translation Ouyang, Siqi Xu, Xi Dandekar, Chinmay Li, Lei Computation and Language Artificial Intelligence Simultaneous speech translation (SST) takes streaming speech input and generates text translation on the fly. Existing methods either have high latency due to recomputation of input representations, or fall behind of offline ST in translation quality. In this paper, we propose FASST, a fast large language model based method for streaming speech translation. We propose blockwise-causal speech encoding and consistency mask, so that streaming speech input can be encoded incrementally without recomputation. Furthermore, we develop a two-stage training strategy to optimize FASST for simultaneous inference. We evaluate FASST and multiple strong prior models on MuST-C dataset. Experiment results show that FASST achieves the best quality-latency trade-off. It outperforms the previous best model by an average of 1.5 BLEU under the same latency for English to Spanish translation.
title	FASST: Fast LLM-based Simultaneous Speech Translation
topic	Computation and Language Artificial Intelligence
url	https://arxiv.org/abs/2408.09430

Similar Items