Saved in:
| Main Authors: | , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2023
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2311.02248 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866909223744438272 |
|---|---|
| author | Pan, Jing Wu, Jian Gaur, Yashesh Sivasankaran, Sunit Chen, Zhuo Liu, Shujie Li, Jinyu |
| author_facet | Pan, Jing Wu, Jian Gaur, Yashesh Sivasankaran, Sunit Chen, Zhuo Liu, Shujie Li, Jinyu |
| contents | We present a cost-effective method to integrate speech into a large language model (LLM), resulting in a Contextual Speech Model with Instruction-following/in-context-learning Capabilities (COSMIC) multi-modal LLM. Using GPT-3.5, we generate Speech Comprehension Test Question-Answer (SQA) pairs from speech transcriptions for supervised instruction tuning. With under 30 million trainable parameters and only 450 hours of English speech data, COSMIC demonstrates emerging capabilities in instruction-following and in-context learning. Equipped with such capabilities, COSMIC achieves a maximum 33.18 BLEU score in 0-shot EN-to-X speech to text translation (S2TT) and a significant boost in the 1-shot setting. Additionally, there is an average 25.8\% relative Word Error Rate (WER) reduction for 1-shot cross-domain adaptation. COSMIC exhibits a significant automatic speech recognition (ASR) accuracy gain in contextual biasing tasks due to its instruction-following capability. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2311_02248 |
| institution | arXiv |
| publishDate | 2023 |
| record_format | arxiv |
| spellingShingle | COSMIC: Data Efficient Instruction-tuning For Speech In-Context Learning Pan, Jing Wu, Jian Gaur, Yashesh Sivasankaran, Sunit Chen, Zhuo Liu, Shujie Li, Jinyu Computation and Language Artificial Intelligence Audio and Speech Processing We present a cost-effective method to integrate speech into a large language model (LLM), resulting in a Contextual Speech Model with Instruction-following/in-context-learning Capabilities (COSMIC) multi-modal LLM. Using GPT-3.5, we generate Speech Comprehension Test Question-Answer (SQA) pairs from speech transcriptions for supervised instruction tuning. With under 30 million trainable parameters and only 450 hours of English speech data, COSMIC demonstrates emerging capabilities in instruction-following and in-context learning. Equipped with such capabilities, COSMIC achieves a maximum 33.18 BLEU score in 0-shot EN-to-X speech to text translation (S2TT) and a significant boost in the 1-shot setting. Additionally, there is an average 25.8\% relative Word Error Rate (WER) reduction for 1-shot cross-domain adaptation. COSMIC exhibits a significant automatic speech recognition (ASR) accuracy gain in contextual biasing tasks due to its instruction-following capability. |
| title | COSMIC: Data Efficient Instruction-tuning For Speech In-Context Learning |
| topic | Computation and Language Artificial Intelligence Audio and Speech Processing |
| url | https://arxiv.org/abs/2311.02248 |