Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhang, Hongxuan, Liu, Zhining, Zhao, Yao, Zheng, Jiaqi, Zhuang, Chenyi, Gu, Jinjie, Chen, Guihai
Format:	Preprint
Published:	2023
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2311.08263
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910470128009216
author	Zhang, Hongxuan Liu, Zhining Zhao, Yao Zheng, Jiaqi Zhuang, Chenyi Gu, Jinjie Chen, Guihai
author_facet	Zhang, Hongxuan Liu, Zhining Zhao, Yao Zheng, Jiaqi Zhuang, Chenyi Gu, Jinjie Chen, Guihai
contents	In this work, we propose FastCoT, a model-agnostic framework based on parallel decoding without any further training of an auxiliary model or modification to the LLM itself. FastCoT uses a size-varying context window whose size changes with position to conduct parallel decoding and auto-regressive decoding simultaneously, thus fully utilizing GPU computation resources. In FastCoT, the parallel decoding part provides the LLM with a quick glance of the future composed of approximate tokens, which could lead to faster answers compared to regular autoregressive decoding used by causal transformers. We also provide an implementation of parallel decoding within LLM, which supports KV-cache generation and batch processing. Through extensive experiments, we demonstrate that FastCoT saves inference time by nearly 20% with only a negligible performance drop compared to the regular approach. Additionally, we show that the context window size exhibits considerable robustness for different tasks.
format	Preprint
id	arxiv_https___arxiv_org_abs_2311_08263
institution	arXiv
publishDate	2023
record_format	arxiv
spellingShingle	Fast Chain-of-Thought: A Glance of Future from Parallel Decoding Leads to Answers Faster Zhang, Hongxuan Liu, Zhining Zhao, Yao Zheng, Jiaqi Zhuang, Chenyi Gu, Jinjie Chen, Guihai Computation and Language In this work, we propose FastCoT, a model-agnostic framework based on parallel decoding without any further training of an auxiliary model or modification to the LLM itself. FastCoT uses a size-varying context window whose size changes with position to conduct parallel decoding and auto-regressive decoding simultaneously, thus fully utilizing GPU computation resources. In FastCoT, the parallel decoding part provides the LLM with a quick glance of the future composed of approximate tokens, which could lead to faster answers compared to regular autoregressive decoding used by causal transformers. We also provide an implementation of parallel decoding within LLM, which supports KV-cache generation and batch processing. Through extensive experiments, we demonstrate that FastCoT saves inference time by nearly 20% with only a negligible performance drop compared to the regular approach. Additionally, we show that the context window size exhibits considerable robustness for different tasks.
title	Fast Chain-of-Thought: A Glance of Future from Parallel Decoding Leads to Answers Faster
topic	Computation and Language
url	https://arxiv.org/abs/2311.08263

Similar Items