Saved in:
| Main Authors: | Zhou, Enyu, Sheng, Kai, Chen, Hao, He, Xin |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2508.04462 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
SpecPipe: Accelerating Pipeline Parallelism-based LLM Inference with Speculative Decoding
by: Yin, Haofei, et al.
Published: (2025)
by: Yin, Haofei, et al.
Published: (2025)
Recursive Speculative Decoding: Accelerating LLM Inference via Sampling Without Replacement
by: Jeon, Wonseok, et al.
Published: (2024)
by: Jeon, Wonseok, et al.
Published: (2024)
Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference
by: Chen, Hao Mark, et al.
Published: (2024)
by: Chen, Hao Mark, et al.
Published: (2024)
Compiler-Assisted Speculative Sampling for Accelerated LLM Inference on Heterogeneous Edge Devices
by: Mesa, Alejandro Ruiz y, et al.
Published: (2026)
by: Mesa, Alejandro Ruiz y, et al.
Published: (2026)
Speculative Decoding with CTC-based Draft Model for LLM Inference Acceleration
by: Wen, Zhuofan, et al.
Published: (2024)
by: Wen, Zhuofan, et al.
Published: (2024)
Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies
by: Timor, Nadav, et al.
Published: (2025)
by: Timor, Nadav, et al.
Published: (2025)
RelayCaching: Accelerating LLM Collaboration via Decoding KV Cache Reuse
by: Geng, Yingsheng, et al.
Published: (2026)
by: Geng, Yingsheng, et al.
Published: (2026)
SPIRe: Boosting LLM Inference Throughput with Speculative Decoding
by: Neelam, Sanjit, et al.
Published: (2025)
by: Neelam, Sanjit, et al.
Published: (2025)
SNLP: Layer-Parallel Inference via Structured Newton Corrections
by: Han, Ligong, et al.
Published: (2026)
by: Han, Ligong, et al.
Published: (2026)
Collaborative Large Language Model Inference via Resource-Aware Parallel Speculative Decoding
by: Koh, Jungyeon, et al.
Published: (2025)
by: Koh, Jungyeon, et al.
Published: (2025)
Fast Inference via Hierarchical Speculative Decoding
by: Mohri, Clara, et al.
Published: (2025)
by: Mohri, Clara, et al.
Published: (2025)
AttnCache: Accelerating Self-Attention Inference for LLM Prefill via Attention Cache
by: Song, Dinghong, et al.
Published: (2025)
by: Song, Dinghong, et al.
Published: (2025)
MineDraft: A Framework for Batch Parallel Speculative Decoding
by: Tang, Zhenwei, et al.
Published: (2026)
by: Tang, Zhenwei, et al.
Published: (2026)
Accelerating Transformer Inference for Translation via Parallel Decoding
by: Santilli, Andrea, et al.
Published: (2023)
by: Santilli, Andrea, et al.
Published: (2023)
Calibrated Speculative Decoding: Frequency-Guided Candidate Selection for Efficient Inference
by: Zhou, Xuwen, et al.
Published: (2026)
by: Zhou, Xuwen, et al.
Published: (2026)
Spiffy: Multiplying Diffusion LLM Acceleration via Lossless Speculative Decoding
by: Agrawal, Sudhanshu, et al.
Published: (2025)
by: Agrawal, Sudhanshu, et al.
Published: (2025)
ParallelSpec: Parallel Drafter for Efficient Speculative Decoding
by: Xiao, Zilin, et al.
Published: (2024)
by: Xiao, Zilin, et al.
Published: (2024)
Mistletoe: Stealthy Acceleration-Collapse Attacks on Speculative Decoding
by: Sun, Shuoyang, et al.
Published: (2026)
by: Sun, Shuoyang, et al.
Published: (2026)
SRT: Accelerating Reinforcement Learning via Speculative Rollout with Tree-Structured Cache
by: Chang, Chi-Chih, et al.
Published: (2026)
by: Chang, Chi-Chih, et al.
Published: (2026)
Energy-Efficient Wireless LLM Inference via Uncertainty and Importance-Aware Speculative Decoding
by: Park, Jihoon, et al.
Published: (2025)
by: Park, Jihoon, et al.
Published: (2025)
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
by: Cai, Tianle, et al.
Published: (2024)
by: Cai, Tianle, et al.
Published: (2024)
Accelerating Large-Scale Reasoning Model Inference with Sparse Self-Speculative Decoding
by: Zhao, Yilong, et al.
Published: (2025)
by: Zhao, Yilong, et al.
Published: (2025)
Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching
by: Dong, Yanhao, et al.
Published: (2025)
by: Dong, Yanhao, et al.
Published: (2025)
Semi-Clairvoyant Scheduling of Speculative Decoding Requests to Minimize LLM Inference Latency
by: Li, Ruixiao, et al.
Published: (2025)
by: Li, Ruixiao, et al.
Published: (2025)
SpecMoE: A Fast and Efficient Mixture-of-Experts Inference via Self-Assisted Speculative Decoding
by: Bang, Jehyeon, et al.
Published: (2026)
by: Bang, Jehyeon, et al.
Published: (2026)
Speculative Speculative Decoding
by: Kumar, Tanishq, et al.
Published: (2026)
by: Kumar, Tanishq, et al.
Published: (2026)
AMUSD: Asynchronous Multi-Device Speculative Decoding for LLM Acceleration
by: McDanel, Bradley
Published: (2024)
by: McDanel, Bradley
Published: (2024)
LiteCache: A Query Similarity-Driven, GPU-Centric KVCache Subsystem for Efficient LLM Inference
by: Yi, Jiawei, et al.
Published: (2025)
by: Yi, Jiawei, et al.
Published: (2025)
TPP-SD: Accelerating Transformer Point Process Sampling with Speculative Decoding
by: Gong, Shukai, et al.
Published: (2025)
by: Gong, Shukai, et al.
Published: (2025)
Cactus: Accelerating Auto-Regressive Decoding with Constrained Acceptance Speculative Sampling
by: Hao, Yongchang, et al.
Published: (2026)
by: Hao, Yongchang, et al.
Published: (2026)
QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache
by: Tiwari, Rishabh, et al.
Published: (2025)
by: Tiwari, Rishabh, et al.
Published: (2025)
FastEagle: Cascaded Drafting for Accelerating Speculative Decoding
by: Huang, Haiduo, et al.
Published: (2025)
by: Huang, Haiduo, et al.
Published: (2025)
Accelerating Time Series Foundation Models with Speculative Decoding
by: Subbaraman, Pranav, et al.
Published: (2025)
by: Subbaraman, Pranav, et al.
Published: (2025)
Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct Decoding
by: Yi, Hanling, et al.
Published: (2024)
by: Yi, Hanling, et al.
Published: (2024)
ScoutAttention: Efficient KV Cache Offloading via Layer-Ahead CPU Pre-computation for LLM Inference
by: Zhang, Qiuyang, et al.
Published: (2026)
by: Zhang, Qiuyang, et al.
Published: (2026)
CAS-Spec: Cascade Adaptive Self-Speculative Decoding for On-the-Fly Lossless Inference Acceleration of LLMs
by: Ning, Zhiyuan, et al.
Published: (2025)
by: Ning, Zhiyuan, et al.
Published: (2025)
Lever: Speculative LLM Inference on Smartphones
by: Wang, Tuowei, et al.
Published: (2026)
by: Wang, Tuowei, et al.
Published: (2026)
CATS: Cascaded Adaptive Tree Speculation for Memory-Limited LLM Inference Acceleration
by: Han, Yuning, et al.
Published: (2026)
by: Han, Yuning, et al.
Published: (2026)
Block Verification Accelerates Speculative Decoding
by: Sun, Ziteng, et al.
Published: (2024)
by: Sun, Ziteng, et al.
Published: (2024)
Accelerating RL Post-Training Rollouts via System-Integrated Speculative Decoding
by: Iso, Hayate, et al.
Published: (2026)
by: Iso, Hayate, et al.
Published: (2026)
Similar Items
-
SpecPipe: Accelerating Pipeline Parallelism-based LLM Inference with Speculative Decoding
by: Yin, Haofei, et al.
Published: (2025) -
Recursive Speculative Decoding: Accelerating LLM Inference via Sampling Without Replacement
by: Jeon, Wonseok, et al.
Published: (2024) -
Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference
by: Chen, Hao Mark, et al.
Published: (2024) -
Compiler-Assisted Speculative Sampling for Accelerated LLM Inference on Heterogeneous Edge Devices
by: Mesa, Alejandro Ruiz y, et al.
Published: (2026) -
Speculative Decoding with CTC-based Draft Model for LLM Inference Acceleration
by: Wen, Zhuofan, et al.
Published: (2024)