Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Abramovich, Talor, Ashkenazi, Maor, Putterman, Izzy, Chislett, Benjamin, Mitra, Tiyasa, Rouhani, Bita Darvish, Zilberstein, Ran, Geifman, Yonatan
Format:	Preprint
Published:	2026
Subjects:	Distributed, Parallel, and Cluster Computing Artificial Intelligence
Online Access:	https://arxiv.org/abs/2604.09557
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866918529363607552
author	Abramovich, Talor Ashkenazi, Maor Putterman, Izzy Chislett, Benjamin Mitra, Tiyasa Rouhani, Bita Darvish Zilberstein, Ran Geifman, Yonatan
author_facet	Abramovich, Talor Ashkenazi, Maor Putterman, Izzy Chislett, Benjamin Mitra, Tiyasa Rouhani, Bita Darvish Zilberstein, Ran Geifman, Yonatan
contents	Speculative Decoding (SD) has emerged as a critical technique for accelerating Large Language Model (LLM) inference. Unlike deterministic system optimizations, SD performance is inherently data-dependent, meaning that diverse and representative workloads are essential for accurately measuring its effectiveness. Existing benchmarks suffer from limited task diversity, inadequate support for throughput-oriented evaluation, and a reliance on high-level implementations that fail to reflect production environments. To address this, we introduce SPEED-Bench, a comprehensive suite designed to standardize SD evaluation across diverse semantic domains and realistic serving regimes. SPEED-Bench offers a carefully curated Qualitative data split, selected by prioritizing semantic diversity across the data samples. Additionally, it includes a Throughput data split, allowing speedup evaluation across a range of concurrencies, from latency-sensitive low-batch settings to throughput-oriented high-load scenarios. By integrating with production engines like vLLM and TensorRT-LLM, SPEED-Bench allows practitioners to analyze system behaviors often masked by other benchmarks. We highlight this by quantifying how synthetic inputs overestimate real-world throughput, identifying batch-size dependent optimal draft lengths and biases in low-diversity data, and analyzing the caveats of vocabulary pruning in state-of-the-art drafters. We release SPEED-Bench to establish a unified evaluation standard for practical comparisons of SD algorithms.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_09557
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding Abramovich, Talor Ashkenazi, Maor Putterman, Izzy Chislett, Benjamin Mitra, Tiyasa Rouhani, Bita Darvish Zilberstein, Ran Geifman, Yonatan Distributed, Parallel, and Cluster Computing Artificial Intelligence Speculative Decoding (SD) has emerged as a critical technique for accelerating Large Language Model (LLM) inference. Unlike deterministic system optimizations, SD performance is inherently data-dependent, meaning that diverse and representative workloads are essential for accurately measuring its effectiveness. Existing benchmarks suffer from limited task diversity, inadequate support for throughput-oriented evaluation, and a reliance on high-level implementations that fail to reflect production environments. To address this, we introduce SPEED-Bench, a comprehensive suite designed to standardize SD evaluation across diverse semantic domains and realistic serving regimes. SPEED-Bench offers a carefully curated Qualitative data split, selected by prioritizing semantic diversity across the data samples. Additionally, it includes a Throughput data split, allowing speedup evaluation across a range of concurrencies, from latency-sensitive low-batch settings to throughput-oriented high-load scenarios. By integrating with production engines like vLLM and TensorRT-LLM, SPEED-Bench allows practitioners to analyze system behaviors often masked by other benchmarks. We highlight this by quantifying how synthetic inputs overestimate real-world throughput, identifying batch-size dependent optimal draft lengths and biases in low-diversity data, and analyzing the caveats of vocabulary pruning in state-of-the-art drafters. We release SPEED-Bench to establish a unified evaluation standard for practical comparisons of SD algorithms.
title	SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding
topic	Distributed, Parallel, and Cluster Computing Artificial Intelligence
url	https://arxiv.org/abs/2604.09557

Similar Items