Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Ankner, Zachary, Parthasarathy, Rishab, Nrusimha, Aniruddha, Rinard, Christopher, Ragan-Kelley, Jonathan, Brandon, William
Format:	Preprint
Published:	2024
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2402.05109
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914966111518720
author	Ankner, Zachary Parthasarathy, Rishab Nrusimha, Aniruddha Rinard, Christopher Ragan-Kelley, Jonathan Brandon, William
author_facet	Ankner, Zachary Parthasarathy, Rishab Nrusimha, Aniruddha Rinard, Christopher Ragan-Kelley, Jonathan Brandon, William
contents	To combat the memory bandwidth-bound nature of autoregressive LLM inference, previous research has proposed the speculative decoding frame-work. To perform speculative decoding, a small draft model proposes candidate continuations of the input sequence that are then verified in parallel by the base model. One way to specify the draft model, as used in the recent Medusa decoding framework, is as a collection of lightweight heads, called draft heads, that operate on the base model's hidden states. To date, all existing draft heads have been sequentially independent, meaning that they speculate tokens in the candidate continuation independently of any preceding tokens in the candidate continuation. In this work, we propose Hydra heads: a sequentially-dependent drop-in replacement for standard draft heads that significantly improves the accuracy of draft head speculation. We further explore the design space of Hydra head training objectives and architectures, and propose a carefully tuned Hydra head recipe, which we call Hydra++, that improves decoding throughput by up to 1.31x and 2.70x compared to Medusa decoding and autoregressive de-coding respectively. Overall, Hydra heads are a simple and well-motivated intervention on standard draft heads that significantly improve the end-to-end speed of draft head-based speculative decoding. We make our code publicly available at https://github.com/zankner/Hydra.
format	Preprint
id	arxiv_https___arxiv_org_abs_2402_05109
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Hydra: Sequentially-Dependent Draft Heads for Medusa Decoding Ankner, Zachary Parthasarathy, Rishab Nrusimha, Aniruddha Rinard, Christopher Ragan-Kelley, Jonathan Brandon, William Machine Learning To combat the memory bandwidth-bound nature of autoregressive LLM inference, previous research has proposed the speculative decoding frame-work. To perform speculative decoding, a small draft model proposes candidate continuations of the input sequence that are then verified in parallel by the base model. One way to specify the draft model, as used in the recent Medusa decoding framework, is as a collection of lightweight heads, called draft heads, that operate on the base model's hidden states. To date, all existing draft heads have been sequentially independent, meaning that they speculate tokens in the candidate continuation independently of any preceding tokens in the candidate continuation. In this work, we propose Hydra heads: a sequentially-dependent drop-in replacement for standard draft heads that significantly improves the accuracy of draft head speculation. We further explore the design space of Hydra head training objectives and architectures, and propose a carefully tuned Hydra head recipe, which we call Hydra++, that improves decoding throughput by up to 1.31x and 2.70x compared to Medusa decoding and autoregressive de-coding respectively. Overall, Hydra heads are a simple and well-motivated intervention on standard draft heads that significantly improve the end-to-end speed of draft head-based speculative decoding. We make our code publicly available at https://github.com/zankner/Hydra.
title	Hydra: Sequentially-Dependent Draft Heads for Medusa Decoding
topic	Machine Learning
url	https://arxiv.org/abs/2402.05109

Similar Items