Table of Contents: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Chan, Brian J, Huang, MaoXun, Cheng, Jui-Hung, Chen, Chao-Ting, Huang, Hen-Hsen
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2502.00085
Tags:	Add Tag No Tags, Be the first to tag this record!

Table of Contents:

This work presents a novel trie (prefix-tree)-based parallel decoding method that addresses the memory inefficiency of batch-based beam search. By sharing a single KV cache across beams with common prefixes, our approach dramatically reduces memory usage and enables efficient decoding. We evaluated our method across three attention architectures, Multi-Head Attention (Phi-3.5-mini-instruct), Grouped Query Attention (Llama-3.1-8B-Instruct), and Sliding Window Attention (Mistral-Small-24B-Instruct-2501), using CNN/DailyMail for abstractive summarization and HumanEval for code generation. Our experiments demonstrate substantial memory savings (4--8$\times$) and up to 2.4$\times$ faster decoding, without compromising generation quality. These results highlight our method's suitability for memory-constrained environments and large-scale deployments.

Similar Items