Vista Equipo: :: Library Catalog

Guardado en:

Detalles Bibliográficos
Autores principales:	Mehra, Somesh, Garcia, Javier Alonso, Mauch, Lukas
Formato:	Preprint
Publicado:	2025
Materias:	Computation and Language Machine Learning
Acceso en línea:	https://arxiv.org/abs/2502.09419
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

_version_	1866910825673916416
author	Mehra, Somesh Garcia, Javier Alonso Mauch, Lukas
author_facet	Mehra, Somesh Garcia, Javier Alonso Mauch, Lukas
contents	We systematically investigate multi-token prediction (MTP) capabilities within LLMs pre-trained for next-token prediction (NTP). We first show that such models inherently possess MTP capabilities via numerical marginalization over intermediate token probabilities, though performance is data-dependent and improves with model scale. Furthermore, we explore the challenges of integrating MTP heads into frozen LLMs and find that their hidden layers are strongly specialized for NTP, making adaptation non-trivial. Finally, we show that while joint training of MTP heads with the backbone improves performance, it cannot fully overcome this barrier, prompting further research in this direction. Our findings provide a deeper understanding of MTP applied to pretrained LLMs, informing strategies for accelerating inference through parallel token prediction.
format	Preprint
id	arxiv_https___arxiv_org_abs_2502_09419
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	On multi-token prediction for efficient LLM inference Mehra, Somesh Garcia, Javier Alonso Mauch, Lukas Computation and Language Machine Learning We systematically investigate multi-token prediction (MTP) capabilities within LLMs pre-trained for next-token prediction (NTP). We first show that such models inherently possess MTP capabilities via numerical marginalization over intermediate token probabilities, though performance is data-dependent and improves with model scale. Furthermore, we explore the challenges of integrating MTP heads into frozen LLMs and find that their hidden layers are strongly specialized for NTP, making adaptation non-trivial. Finally, we show that while joint training of MTP heads with the backbone improves performance, it cannot fully overcome this barrier, prompting further research in this direction. Our findings provide a deeper understanding of MTP applied to pretrained LLMs, informing strategies for accelerating inference through parallel token prediction.
title	On multi-token prediction for efficient LLM inference
topic	Computation and Language Machine Learning
url	https://arxiv.org/abs/2502.09419

Ejemplares similares