Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Holsman, Maximilian, Huang, Yukun, Dhingra, Bhuwan
Format:	Preprint
Published:	2025
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2502.20704
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909634386722816
author	Holsman, Maximilian Huang, Yukun Dhingra, Bhuwan
author_facet	Holsman, Maximilian Huang, Yukun Dhingra, Bhuwan
contents	Speculative Decoding (SD) enforces strict distributional equivalence to the target model when accepting candidate tokens. While it maintains the target model's generation quality, this strict equivalence limits the speedup achievable by SD and prevents users from trading deviations from the target distribution in exchange for further inference speed gains. To address these limitations, we introduce Fuzzy Speculative Decoding (FSD) - a decoding algorithm that generalizes SD by accepting candidate tokens based on the divergences between the target and draft model distributions. By allowing for controlled divergence from the target model, FSD enables users to flexibly trade generation quality for inference speed. Across several benchmarks, our method is able to achieve significant runtime improvements of over 5 tokens per second faster than SD at only an approximate 2% absolute reduction in benchmark accuracy. In many cases, FSD is even able to match SD benchmark accuracy at over 2 tokens per second faster, demonstrating that distributional equivalence is not necessary to maintain target model performance. Furthermore, FSD can be seamlessly integrated into existing SD extensions; we demonstrate this by applying FSD to EAGLE-2, greatly enhancing this existing extension's efficiency while allowing it to leverage FSD's tunable quality-speed trade-off.
format	Preprint
id	arxiv_https___arxiv_org_abs_2502_20704
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Fuzzy Speculative Decoding for a Tunable Accuracy-Runtime Tradeoff Holsman, Maximilian Huang, Yukun Dhingra, Bhuwan Artificial Intelligence Speculative Decoding (SD) enforces strict distributional equivalence to the target model when accepting candidate tokens. While it maintains the target model's generation quality, this strict equivalence limits the speedup achievable by SD and prevents users from trading deviations from the target distribution in exchange for further inference speed gains. To address these limitations, we introduce Fuzzy Speculative Decoding (FSD) - a decoding algorithm that generalizes SD by accepting candidate tokens based on the divergences between the target and draft model distributions. By allowing for controlled divergence from the target model, FSD enables users to flexibly trade generation quality for inference speed. Across several benchmarks, our method is able to achieve significant runtime improvements of over 5 tokens per second faster than SD at only an approximate 2% absolute reduction in benchmark accuracy. In many cases, FSD is even able to match SD benchmark accuracy at over 2 tokens per second faster, demonstrating that distributional equivalence is not necessary to maintain target model performance. Furthermore, FSD can be seamlessly integrated into existing SD extensions; we demonstrate this by applying FSD to EAGLE-2, greatly enhancing this existing extension's efficiency while allowing it to leverage FSD's tunable quality-speed trade-off.
title	Fuzzy Speculative Decoding for a Tunable Accuracy-Runtime Tradeoff
topic	Artificial Intelligence
url	https://arxiv.org/abs/2502.20704

Similar Items