_version_ 1866915440213622784
author Tang, Bangsheng
Fu, Carl Chengyan
Kou, Fei
Sizov, Grigory
Zhang, Haoci
Park, Jason
Liu, Jiawen
You, Jie
Yang, Qirui
Mehta, Sachin
Cai, Shengyong
Wang, Xiaodong
Liu, Xingyu
Li, Yunlu
Zhou, Yanjun
Wei, Wei
Zhao, Zhiwei
Qi, Zixi
Victoria, Adolfo
Ibrahim, Aya
Wasti, Bram
Kim, Changkyu
Haziza, Daniel
Sun, Fei
Delfin, Giancarlo
Guo, Emily
Ouyang, Jialin
Lee, Jaewon
Huang, Jianyu
Reizenstein, Jeremy
Fang, Lu
Zhu, Quinn
Verma, Ria
Mihailescu, Vlad
Guo, Xingwen
Cui, Yan
Hu, Ye
Lee, Yejin
author_facet Tang, Bangsheng
Fu, Carl Chengyan
Kou, Fei
Sizov, Grigory
Zhang, Haoci
Park, Jason
Liu, Jiawen
You, Jie
Yang, Qirui
Mehta, Sachin
Cai, Shengyong
Wang, Xiaodong
Liu, Xingyu
Li, Yunlu
Zhou, Yanjun
Wei, Wei
Zhao, Zhiwei
Qi, Zixi
Victoria, Adolfo
Ibrahim, Aya
Wasti, Bram
Kim, Changkyu
Haziza, Daniel
Sun, Fei
Delfin, Giancarlo
Guo, Emily
Ouyang, Jialin
Lee, Jaewon
Huang, Jianyu
Reizenstein, Jeremy
Fang, Lu
Zhu, Quinn
Verma, Ria
Mihailescu, Vlad
Guo, Xingwen
Cui, Yan
Hu, Ye
Lee, Yejin
contents Speculative decoding is a standard method for accelerating the inference speed of large language models. However, scaling it for production environments poses several engineering challenges, including efficiently implementing different operations (e.g., tree attention and multi-round speculative decoding) on GPU. In this paper, we detail the training and inference optimization techniques that we have implemented to enable EAGLE-based speculative decoding at a production scale for Llama models. With these changes, we achieve a new state-of-the-art inference latency for Llama models. For example, Llama4 Maverick decodes at a speed of about 4 ms per token (with a batch size of one) on 8 NVIDIA H100 GPUs, which is 10% faster than the previously best known method. Furthermore, for EAGLE-based speculative decoding, our optimizations enable us to achieve a speed-up for large batch sizes between 1.4x and 2.0x at production scale.
format Preprint
id arxiv_https___arxiv_org_abs_2508_08192
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Efficient Speculative Decoding for Llama at Scale: Challenges and Solutions
Tang, Bangsheng
Fu, Carl Chengyan
Kou, Fei
Sizov, Grigory
Zhang, Haoci
Park, Jason
Liu, Jiawen
You, Jie
Yang, Qirui
Mehta, Sachin
Cai, Shengyong
Wang, Xiaodong
Liu, Xingyu
Li, Yunlu
Zhou, Yanjun
Wei, Wei
Zhao, Zhiwei
Qi, Zixi
Victoria, Adolfo
Ibrahim, Aya
Wasti, Bram
Kim, Changkyu
Haziza, Daniel
Sun, Fei
Delfin, Giancarlo
Guo, Emily
Ouyang, Jialin
Lee, Jaewon
Huang, Jianyu
Reizenstein, Jeremy
Fang, Lu
Zhu, Quinn
Verma, Ria
Mihailescu, Vlad
Guo, Xingwen
Cui, Yan
Hu, Ye
Lee, Yejin
Computation and Language
Speculative decoding is a standard method for accelerating the inference speed of large language models. However, scaling it for production environments poses several engineering challenges, including efficiently implementing different operations (e.g., tree attention and multi-round speculative decoding) on GPU. In this paper, we detail the training and inference optimization techniques that we have implemented to enable EAGLE-based speculative decoding at a production scale for Llama models. With these changes, we achieve a new state-of-the-art inference latency for Llama models. For example, Llama4 Maverick decodes at a speed of about 4 ms per token (with a batch size of one) on 8 NVIDIA H100 GPUs, which is 10% faster than the previously best known method. Furthermore, for EAGLE-based speculative decoding, our optimizations enable us to achieve a speed-up for large batch sizes between 1.4x and 2.0x at production scale.
title Efficient Speculative Decoding for Llama at Scale: Challenges and Solutions
topic Computation and Language
url https://arxiv.org/abs/2508.08192