:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Chen, Ziyi, Yang, Xiaocong, Lin, Jiacheng, Sun, Chenkai, Chang, Kevin Chen-Chuan, Huang, Jie
Format:	Preprint
Published:	2023
Subjects:	Machine Learning Computation and Language
Online Access:	https://arxiv.org/abs/2312.11462
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Speculative Decoding with CTC-based Draft Model for LLM Inference Acceleration
by: Wen, Zhuofan, et al.
Published: (2024)

$\texttt{SPECS}$: Faster Test-Time Scaling through Speculative Drafts
by: Cemri, Mert, et al.
Published: (2025)

Faster Cascades via Speculative Decoding
by: Narasimhan, Harikrishna, et al.
Published: (2024)

Balancing Coverage and Draft Latency in Vocabulary Trimming for Faster Speculative Decoding
by: Shoham, Ofir Ben
Published: (2026)

PRISM: Parametrically Refactoring Inference for Speculative Sampling Draft Models
by: Wang, Xuliang, et al.
Published: (2026)

EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees
by: Li, Yuhui, et al.
Published: (2024)

OmniDraft: A Cross-vocabulary, Online Adaptive Drafter for On-device Speculative Decoding
by: Ramakrishnan, Ramchalam Kinattinkara, et al.
Published: (2025)

POSS: Position Specialist Generates Better Draft for Speculative Decoding
by: Huang, Langlin, et al.
Published: (2025)

TABED: Test-Time Adaptive Ensemble Drafting for Robust Speculative Decoding in LVLMs
by: Lee, Minjae, et al.
Published: (2026)

Distributed Speculative Inference (DSI): Speculation Parallelism for Provably Faster Lossless Language Model Inference
by: Timor, Nadav, et al.
Published: (2024)

Faster LLM Inference via Sequential Monte Carlo
by: Emara, Yahya, et al.
Published: (2026)

Faster MoE LLM Inference for Extremely Large Models
by: Yang, Haoqi, et al.
Published: (2025)

SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding
by: Plaksin, Anton, et al.
Published: (2026)

Make Every Draft Count: Hidden State based Speculative Decoding
by: Chen, Yuetao, et al.
Published: (2026)

Semi-Clairvoyant Scheduling of Speculative Decoding Requests to Minimize LLM Inference Latency
by: Li, Ruixiao, et al.
Published: (2025)

LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification
by: Yang, Penghui, et al.
Published: (2025)

ML-SpecQD: Multi-Level Speculative Decoding with Quantized Drafts
by: Georganas, Evangelos, et al.
Published: (2025)

Training Domain Draft Models for Speculative Decoding: Best Practices and Insights
by: Hong, Fenglu, et al.
Published: (2025)

Self-Speculative Biased Decoding for Faster Re-Translation
by: Zeng, Linxiao, et al.
Published: (2025)

Direct Alignment of Draft Model for Speculative Decoding with Chat-Fine-Tuned LLMs
by: Goel, Raghavv, et al.
Published: (2024)

MineDraft: A Framework for Batch Parallel Speculative Decoding
by: Tang, Zhenwei, et al.
Published: (2026)

Multi-Draft Speculative Sampling: Canonical Decomposition and Theoretical Limits
by: Khisti, Ashish, et al.
Published: (2024)

Judge Decoding: Faster Speculative Sampling Requires Going Beyond Model Alignment
by: Bachmann, Gregor, et al.
Published: (2025)

Speculative Streaming: Fast LLM Inference without Auxiliary Models
by: Bhendawade, Nikhil, et al.
Published: (2024)

Mistletoe: Stealthy Acceleration-Collapse Attacks on Speculative Decoding
by: Sun, Shuoyang, et al.
Published: (2026)

TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
by: Sun, Hanshi, et al.
Published: (2024)

TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling
by: Qiu, Jiahao, et al.
Published: (2024)

Ouroboros: Generating Longer Drafts Phrase by Phrase for Faster Speculative Decoding
by: Zhao, Weilin, et al.
Published: (2024)

FlashDecoding++: Faster Large Language Model Inference on GPUs
by: Hong, Ke, et al.
Published: (2023)

Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies
by: Timor, Nadav, et al.
Published: (2025)

FastEagle: Cascaded Drafting for Accelerating Speculative Decoding
by: Huang, Haiduo, et al.
Published: (2025)

Clover-2: Accurate Inference for Regressive Lightweight Speculative Decoding
by: Xiao, Bin, et al.
Published: (2024)

PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation
by: Butler, Branden, et al.
Published: (2024)

RAVEN: In-Context Learning with Retrieval-Augmented Encoder-Decoder Language Models
by: Huang, Jie, et al.
Published: (2023)

Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum
by: Pouransari, Hadi, et al.
Published: (2024)

AdaEDL: Early Draft Stopping for Speculative Decoding of Large Language Models via an Entropy-based Lower Bound on Token Acceptance Probability
by: Agrawal, Sudhanshu, et al.
Published: (2024)

QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks
by: Tseng, Albert, et al.
Published: (2024)

ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
by: Sun, Hanshi, et al.
Published: (2024)

A Theoretical Perspective for Speculative Decoding Algorithm
by: Yin, Ming, et al.
Published: (2024)

Calibrated Speculative Decoding: Frequency-Guided Candidate Selection for Efficient Inference
by: Zhou, Xuwen, et al.
Published: (2026)