Guardado en:
Detalles Bibliográficos
Autores principales: Mehra, Somesh, Garcia, Javier Alonso, Mauch, Lukas
Formato: Preprint
Publicado: 2025
Materias:
Acceso en línea:https://arxiv.org/abs/2502.09419
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
_version_ 1866910825673916416
author Mehra, Somesh
Garcia, Javier Alonso
Mauch, Lukas
author_facet Mehra, Somesh
Garcia, Javier Alonso
Mauch, Lukas
contents We systematically investigate multi-token prediction (MTP) capabilities within LLMs pre-trained for next-token prediction (NTP). We first show that such models inherently possess MTP capabilities via numerical marginalization over intermediate token probabilities, though performance is data-dependent and improves with model scale. Furthermore, we explore the challenges of integrating MTP heads into frozen LLMs and find that their hidden layers are strongly specialized for NTP, making adaptation non-trivial. Finally, we show that while joint training of MTP heads with the backbone improves performance, it cannot fully overcome this barrier, prompting further research in this direction. Our findings provide a deeper understanding of MTP applied to pretrained LLMs, informing strategies for accelerating inference through parallel token prediction.
format Preprint
id arxiv_https___arxiv_org_abs_2502_09419
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle On multi-token prediction for efficient LLM inference
Mehra, Somesh
Garcia, Javier Alonso
Mauch, Lukas
Computation and Language
Machine Learning
We systematically investigate multi-token prediction (MTP) capabilities within LLMs pre-trained for next-token prediction (NTP). We first show that such models inherently possess MTP capabilities via numerical marginalization over intermediate token probabilities, though performance is data-dependent and improves with model scale. Furthermore, we explore the challenges of integrating MTP heads into frozen LLMs and find that their hidden layers are strongly specialized for NTP, making adaptation non-trivial. Finally, we show that while joint training of MTP heads with the backbone improves performance, it cannot fully overcome this barrier, prompting further research in this direction. Our findings provide a deeper understanding of MTP applied to pretrained LLMs, informing strategies for accelerating inference through parallel token prediction.
title On multi-token prediction for efficient LLM inference
topic Computation and Language
Machine Learning
url https://arxiv.org/abs/2502.09419