Saved in:
Bibliographic Details
Main Authors: Ouyang, Siqi, Xu, Xi, Dandekar, Chinmay, Li, Lei
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2408.09430
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866916361292218368
author Ouyang, Siqi
Xu, Xi
Dandekar, Chinmay
Li, Lei
author_facet Ouyang, Siqi
Xu, Xi
Dandekar, Chinmay
Li, Lei
contents Simultaneous speech translation (SST) takes streaming speech input and generates text translation on the fly. Existing methods either have high latency due to recomputation of input representations, or fall behind of offline ST in translation quality. In this paper, we propose FASST, a fast large language model based method for streaming speech translation. We propose blockwise-causal speech encoding and consistency mask, so that streaming speech input can be encoded incrementally without recomputation. Furthermore, we develop a two-stage training strategy to optimize FASST for simultaneous inference. We evaluate FASST and multiple strong prior models on MuST-C dataset. Experiment results show that FASST achieves the best quality-latency trade-off. It outperforms the previous best model by an average of 1.5 BLEU under the same latency for English to Spanish translation.
format Preprint
id arxiv_https___arxiv_org_abs_2408_09430
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle FASST: Fast LLM-based Simultaneous Speech Translation
Ouyang, Siqi
Xu, Xi
Dandekar, Chinmay
Li, Lei
Computation and Language
Artificial Intelligence
Simultaneous speech translation (SST) takes streaming speech input and generates text translation on the fly. Existing methods either have high latency due to recomputation of input representations, or fall behind of offline ST in translation quality. In this paper, we propose FASST, a fast large language model based method for streaming speech translation. We propose blockwise-causal speech encoding and consistency mask, so that streaming speech input can be encoded incrementally without recomputation. Furthermore, we develop a two-stage training strategy to optimize FASST for simultaneous inference. We evaluate FASST and multiple strong prior models on MuST-C dataset. Experiment results show that FASST achieves the best quality-latency trade-off. It outperforms the previous best model by an average of 1.5 BLEU under the same latency for English to Spanish translation.
title FASST: Fast LLM-based Simultaneous Speech Translation
topic Computation and Language
Artificial Intelligence
url https://arxiv.org/abs/2408.09430