Affichage MARC: :: Library Catalog

Enregistré dans:

Détails bibliographiques
Auteurs principaux:	Tang, Bangsheng, Fu, Carl Chengyan, Kou, Fei, Sizov, Grigory, Zhang, Haoci, Park, Jason, Liu, Jiawen, You, Jie, Yang, Qirui, Mehta, Sachin, Cai, Shengyong, Wang, Xiaodong, Liu, Xingyu, Li, Yunlu, Zhou, Yanjun, Wei, Wei, Zhao, Zhiwei, Qi, Zixi, Victoria, Adolfo, Ibrahim, Aya, Wasti, Bram, Kim, Changkyu, Haziza, Daniel, Sun, Fei, Delfin, Giancarlo, Guo, Emily, Ouyang, Jialin, Lee, Jaewon, Huang, Jianyu, Reizenstein, Jeremy, Fang, Lu, Zhu, Quinn, Verma, Ria, Mihailescu, Vlad, Guo, Xingwen, Cui, Yan, Hu, Ye, Lee, Yejin
Format:	Preprint
Publié:	2025
Sujets:	Computation and Language
Accès en ligne:	https://arxiv.org/abs/2508.08192
Tags:	Ajouter un tag Pas de tags, Soyez le premier à ajouter un tag!

_version_	1866915440213622784
author	Tang, Bangsheng Fu, Carl Chengyan Kou, Fei Sizov, Grigory Zhang, Haoci Park, Jason Liu, Jiawen You, Jie Yang, Qirui Mehta, Sachin Cai, Shengyong Wang, Xiaodong Liu, Xingyu Li, Yunlu Zhou, Yanjun Wei, Wei Zhao, Zhiwei Qi, Zixi Victoria, Adolfo Ibrahim, Aya Wasti, Bram Kim, Changkyu Haziza, Daniel Sun, Fei Delfin, Giancarlo Guo, Emily Ouyang, Jialin Lee, Jaewon Huang, Jianyu Reizenstein, Jeremy Fang, Lu Zhu, Quinn Verma, Ria Mihailescu, Vlad Guo, Xingwen Cui, Yan Hu, Ye Lee, Yejin
author_facet	Tang, Bangsheng Fu, Carl Chengyan Kou, Fei Sizov, Grigory Zhang, Haoci Park, Jason Liu, Jiawen You, Jie Yang, Qirui Mehta, Sachin Cai, Shengyong Wang, Xiaodong Liu, Xingyu Li, Yunlu Zhou, Yanjun Wei, Wei Zhao, Zhiwei Qi, Zixi Victoria, Adolfo Ibrahim, Aya Wasti, Bram Kim, Changkyu Haziza, Daniel Sun, Fei Delfin, Giancarlo Guo, Emily Ouyang, Jialin Lee, Jaewon Huang, Jianyu Reizenstein, Jeremy Fang, Lu Zhu, Quinn Verma, Ria Mihailescu, Vlad Guo, Xingwen Cui, Yan Hu, Ye Lee, Yejin
contents	Speculative decoding is a standard method for accelerating the inference speed of large language models. However, scaling it for production environments poses several engineering challenges, including efficiently implementing different operations (e.g., tree attention and multi-round speculative decoding) on GPU. In this paper, we detail the training and inference optimization techniques that we have implemented to enable EAGLE-based speculative decoding at a production scale for Llama models. With these changes, we achieve a new state-of-the-art inference latency for Llama models. For example, Llama4 Maverick decodes at a speed of about 4 ms per token (with a batch size of one) on 8 NVIDIA H100 GPUs, which is 10% faster than the previously best known method. Furthermore, for EAGLE-based speculative decoding, our optimizations enable us to achieve a speed-up for large batch sizes between 1.4x and 2.0x at production scale.
format	Preprint
id	arxiv_https___arxiv_org_abs_2508_08192
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Efficient Speculative Decoding for Llama at Scale: Challenges and Solutions Tang, Bangsheng Fu, Carl Chengyan Kou, Fei Sizov, Grigory Zhang, Haoci Park, Jason Liu, Jiawen You, Jie Yang, Qirui Mehta, Sachin Cai, Shengyong Wang, Xiaodong Liu, Xingyu Li, Yunlu Zhou, Yanjun Wei, Wei Zhao, Zhiwei Qi, Zixi Victoria, Adolfo Ibrahim, Aya Wasti, Bram Kim, Changkyu Haziza, Daniel Sun, Fei Delfin, Giancarlo Guo, Emily Ouyang, Jialin Lee, Jaewon Huang, Jianyu Reizenstein, Jeremy Fang, Lu Zhu, Quinn Verma, Ria Mihailescu, Vlad Guo, Xingwen Cui, Yan Hu, Ye Lee, Yejin Computation and Language Speculative decoding is a standard method for accelerating the inference speed of large language models. However, scaling it for production environments poses several engineering challenges, including efficiently implementing different operations (e.g., tree attention and multi-round speculative decoding) on GPU. In this paper, we detail the training and inference optimization techniques that we have implemented to enable EAGLE-based speculative decoding at a production scale for Llama models. With these changes, we achieve a new state-of-the-art inference latency for Llama models. For example, Llama4 Maverick decodes at a speed of about 4 ms per token (with a batch size of one) on 8 NVIDIA H100 GPUs, which is 10% faster than the previously best known method. Furthermore, for EAGLE-based speculative decoding, our optimizations enable us to achieve a speed-up for large batch sizes between 1.4x and 2.0x at production scale.
title	Efficient Speculative Decoding for Llama at Scale: Challenges and Solutions
topic	Computation and Language
url	https://arxiv.org/abs/2508.08192

Documents similaires