Affichage MARC: :: Library Catalog

Enregistré dans:

Détails bibliographiques
Auteurs principaux:	Wu, Yangchao, Qin, Zongyue, Wong, Alex, Soatto, Stefano
Format:	Preprint
Publié:	2025
Sujets:	Machine Learning Artificial Intelligence
Accès en ligne:	https://arxiv.org/abs/2505.14969
Tags:	Ajouter un tag Pas de tags, Soyez le premier à ajouter un tag!

_version_	1866917046807166976
author	Wu, Yangchao Qin, Zongyue Wong, Alex Soatto, Stefano
author_facet	Wu, Yangchao Qin, Zongyue Wong, Alex Soatto, Stefano
contents	Speculative decoding is a technique to leverage hardware concurrency in order to enable multiple steps of token generation in a single forward pass, thus improving the efficiency of large-scale autoregressive (AR) Transformer models. State-space models (SSMs) are already more efficient than AR Transformers, since their state summarizes all past data with no need to cache or re-process tokens in the sliding window context. However, their state can also comprise thousands of tokens; so, speculative decoding has recently been extended to SSMs. Existing approaches, however, do not leverage the tree-based verification methods, since current SSMs lack the means to compute a token tree efficiently. We propose the first scalable algorithm to perform tree-based speculative decoding in state-space models (SSMs) and hybrid architectures of SSMs and Transformer layers. We exploit the structure of accumulated state transition matrices to facilitate tree-based speculative decoding with minimal overhead relative to current SSM implementations. Along with the algorithm, we describe a hardware-aware implementation that improves naive application of AR Transformer tree-based speculative decoding methods to SSMs. Furthermore, we outperform vanilla speculative decoding with SSMs even with a baseline drafting model and tree structure on three different benchmarks, opening up opportunities for further speed up with SSM and hybrid model inference. Code can be found at: https://github.com/wyc1997/stree.
format	Preprint
id	arxiv_https___arxiv_org_abs_2505_14969
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	STree: Speculative Tree Decoding for Hybrid State-Space Models Wu, Yangchao Qin, Zongyue Wong, Alex Soatto, Stefano Machine Learning Artificial Intelligence Speculative decoding is a technique to leverage hardware concurrency in order to enable multiple steps of token generation in a single forward pass, thus improving the efficiency of large-scale autoregressive (AR) Transformer models. State-space models (SSMs) are already more efficient than AR Transformers, since their state summarizes all past data with no need to cache or re-process tokens in the sliding window context. However, their state can also comprise thousands of tokens; so, speculative decoding has recently been extended to SSMs. Existing approaches, however, do not leverage the tree-based verification methods, since current SSMs lack the means to compute a token tree efficiently. We propose the first scalable algorithm to perform tree-based speculative decoding in state-space models (SSMs) and hybrid architectures of SSMs and Transformer layers. We exploit the structure of accumulated state transition matrices to facilitate tree-based speculative decoding with minimal overhead relative to current SSM implementations. Along with the algorithm, we describe a hardware-aware implementation that improves naive application of AR Transformer tree-based speculative decoding methods to SSMs. Furthermore, we outperform vanilla speculative decoding with SSMs even with a baseline drafting model and tree structure on three different benchmarks, opening up opportunities for further speed up with SSM and hybrid model inference. Code can be found at: https://github.com/wyc1997/stree.
title	STree: Speculative Tree Decoding for Hybrid State-Space Models
topic	Machine Learning Artificial Intelligence
url	https://arxiv.org/abs/2505.14969

Documents similaires