:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Neelam, Sanjit, Heinlein, Daniel, Cvicek, Vaclav, Mishra, Akshay, Pope, Reiner
Format:	Preprint
Published:	2025
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2504.06419
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies
by: Timor, Nadav, et al.
Published: (2025)

Out-of-Vocabulary Sampling Boosts Speculative Decoding
by: Timor, Nadav, et al.
Published: (2025)

SpecPipe: Accelerating Pipeline Parallelism-based LLM Inference with Speculative Decoding
by: Yin, Haofei, et al.
Published: (2025)

Fast Inference via Hierarchical Speculative Decoding
by: Mohri, Clara, et al.
Published: (2025)

Speculative Speculative Decoding
by: Kumar, Tanishq, et al.
Published: (2026)

Speculative Decoding Scaling Laws (SDSL): Throughput Optimization Made Simple
by: Bozorgkhoo, Amirhossein, et al.
Published: (2026)

Speculative Decoding with CTC-based Draft Model for LLM Inference Acceleration
by: Wen, Zhuofan, et al.
Published: (2024)

Recursive Speculative Decoding: Accelerating LLM Inference via Sampling Without Replacement
by: Jeon, Wonseok, et al.
Published: (2024)

Lever: Speculative LLM Inference on Smartphones
by: Wang, Tuowei, et al.
Published: (2026)

Energy-Efficient Wireless LLM Inference via Uncertainty and Importance-Aware Speculative Decoding
by: Park, Jihoon, et al.
Published: (2025)

Semi-Clairvoyant Scheduling of Speculative Decoding Requests to Minimize LLM Inference Latency
by: Li, Ruixiao, et al.
Published: (2025)

Hydragen: High-Throughput LLM Inference with Shared Prefixes
by: Juravsky, Jordan, et al.
Published: (2024)

Decoding Speculative Decoding
by: Yan, Minghao, et al.
Published: (2024)

An Interpretable Latency Model for Speculative Decoding in LLM Serving
by: Kong, Linghao, et al.
Published: (2026)

Boosting Lossless Speculative Decoding via Feature Sampling and Partial Alignment Distillation
by: Gui, Lujun, et al.
Published: (2024)

SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths
by: Huang, Kaixuan, et al.
Published: (2024)

CARD: A Cache-Assisted Parallel Speculative Decoding Framework via Query-and-Correct Paradigm for Accelerating LLM Inference
by: Zhou, Enyu, et al.
Published: (2025)

Collaborative Speculative Inference for Efficient LLM Inference Serving
by: Gao, Luyao, et al.
Published: (2025)

Calibrated Speculative Decoding: Frequency-Guided Candidate Selection for Efficient Inference
by: Zhou, Xuwen, et al.
Published: (2026)

Clover-2: Accurate Inference for Regressive Lightweight Speculative Decoding
by: Xiao, Bin, et al.
Published: (2024)

Jakiro: Boosting Speculative Decoding with Decoupled Multi-Head via MoE
by: Huang, Haiduo, et al.
Published: (2025)

Cascade Speculative Drafting for Even Faster LLM Inference
by: Chen, Ziyi, et al.
Published: (2023)

FastVLM: Self-Speculative Decoding for Fast Vision-Language Model Inference
by: Bajpai, Divya Jyoti, et al.
Published: (2025)

Accelerating Large-Scale Reasoning Model Inference with Sparse Self-Speculative Decoding
by: Zhao, Yilong, et al.
Published: (2025)

Closer Look at Efficient Inference Methods: A Survey of Speculative Decoding
by: Ryu, Hyun, et al.
Published: (2024)

LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding
by: Elhoushi, Mostafa, et al.
Published: (2024)

Online Speculative Decoding
by: Liu, Xiaoxuan, et al.
Published: (2023)

Speculative Safety-Aware Decoding
by: Wang, Xuekang, et al.
Published: (2025)

Speculative Decoding Across Languages
by: Paudel, Nirajan, et al.
Published: (2026)

SwiftSpec: Ultra-Low Latency LLM Decoding by Scaling Asynchronous Speculative Decoding
by: Zhang, Ziyi, et al.
Published: (2025)

Throughput-Optimal Scheduling Algorithms for LLM Inference and AI Agents
by: Dai, J. G., et al.
Published: (2025)

Collaborative Large Language Model Inference via Resource-Aware Parallel Speculative Decoding
by: Koh, Jungyeon, et al.
Published: (2025)

LLM-42: Enabling Determinism in LLM Inference with Verified Speculation
by: Gond, Raja, et al.
Published: (2026)

Principled Coarse-Grained Acceptance for Speculative Decoding in Speech
by: Yanuka, Moran, et al.
Published: (2025)

Compiler-Assisted Speculative Sampling for Accelerated LLM Inference on Heterogeneous Edge Devices
by: Mesa, Alejandro Ruiz y, et al.
Published: (2026)

Spiffy: Multiplying Diffusion LLM Acceleration via Lossless Speculative Decoding
by: Agrawal, Sudhanshu, et al.
Published: (2025)

AMUSD: Asynchronous Multi-Device Speculative Decoding for LLM Acceleration
by: McDanel, Bradley
Published: (2024)

Mixture of Attentions For Speculative Decoding
by: Zimmer, Matthieu, et al.
Published: (2024)

Scaling Speculative Decoding with Lookahead Reasoning
by: Fu, Yichao, et al.
Published: (2025)

CAS-Spec: Cascade Adaptive Self-Speculative Decoding for On-the-Fly Lossless Inference Acceleration of LLMs
by: Ning, Zhiyuan, et al.
Published: (2025)