Saved in:
Bibliographic Details
Main Authors: Pan, Jing, Wu, Jian, Gaur, Yashesh, Sivasankaran, Sunit, Chen, Zhuo, Liu, Shujie, Li, Jinyu
Format: Preprint
Published: 2023
Subjects:
Online Access:https://arxiv.org/abs/2311.02248
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866909223744438272
author Pan, Jing
Wu, Jian
Gaur, Yashesh
Sivasankaran, Sunit
Chen, Zhuo
Liu, Shujie
Li, Jinyu
author_facet Pan, Jing
Wu, Jian
Gaur, Yashesh
Sivasankaran, Sunit
Chen, Zhuo
Liu, Shujie
Li, Jinyu
contents We present a cost-effective method to integrate speech into a large language model (LLM), resulting in a Contextual Speech Model with Instruction-following/in-context-learning Capabilities (COSMIC) multi-modal LLM. Using GPT-3.5, we generate Speech Comprehension Test Question-Answer (SQA) pairs from speech transcriptions for supervised instruction tuning. With under 30 million trainable parameters and only 450 hours of English speech data, COSMIC demonstrates emerging capabilities in instruction-following and in-context learning. Equipped with such capabilities, COSMIC achieves a maximum 33.18 BLEU score in 0-shot EN-to-X speech to text translation (S2TT) and a significant boost in the 1-shot setting. Additionally, there is an average 25.8\% relative Word Error Rate (WER) reduction for 1-shot cross-domain adaptation. COSMIC exhibits a significant automatic speech recognition (ASR) accuracy gain in contextual biasing tasks due to its instruction-following capability.
format Preprint
id arxiv_https___arxiv_org_abs_2311_02248
institution arXiv
publishDate 2023
record_format arxiv
spellingShingle COSMIC: Data Efficient Instruction-tuning For Speech In-Context Learning
Pan, Jing
Wu, Jian
Gaur, Yashesh
Sivasankaran, Sunit
Chen, Zhuo
Liu, Shujie
Li, Jinyu
Computation and Language
Artificial Intelligence
Audio and Speech Processing
We present a cost-effective method to integrate speech into a large language model (LLM), resulting in a Contextual Speech Model with Instruction-following/in-context-learning Capabilities (COSMIC) multi-modal LLM. Using GPT-3.5, we generate Speech Comprehension Test Question-Answer (SQA) pairs from speech transcriptions for supervised instruction tuning. With under 30 million trainable parameters and only 450 hours of English speech data, COSMIC demonstrates emerging capabilities in instruction-following and in-context learning. Equipped with such capabilities, COSMIC achieves a maximum 33.18 BLEU score in 0-shot EN-to-X speech to text translation (S2TT) and a significant boost in the 1-shot setting. Additionally, there is an average 25.8\% relative Word Error Rate (WER) reduction for 1-shot cross-domain adaptation. COSMIC exhibits a significant automatic speech recognition (ASR) accuracy gain in contextual biasing tasks due to its instruction-following capability.
title COSMIC: Data Efficient Instruction-tuning For Speech In-Context Learning
topic Computation and Language
Artificial Intelligence
Audio and Speech Processing
url https://arxiv.org/abs/2311.02248