Affichage MARC: :: Library Catalog

Enregistré dans:

Détails bibliographiques
Auteurs principaux:	Ma, Xuezhe, Yang, Xiaomeng, Xiong, Wenhan, Chen, Beidi, Yu, Lili, Zhang, Hao, May, Jonathan, Zettlemoyer, Luke, Levy, Omer, Zhou, Chunting
Format:	Preprint
Publié:	2024
Sujets:	Machine Learning Computation and Language
Accès en ligne:	https://arxiv.org/abs/2404.08801
Tags:	Ajouter un tag Pas de tags, Soyez le premier à ajouter un tag!

_version_	1866910411760074752
author	Ma, Xuezhe Yang, Xiaomeng Xiong, Wenhan Chen, Beidi Yu, Lili Zhang, Hao May, Jonathan Zettlemoyer, Luke Levy, Omer Zhou, Chunting
author_facet	Ma, Xuezhe Yang, Xiaomeng Xiong, Wenhan Chen, Beidi Yu, Lili Zhang, Hao May, Jonathan Zettlemoyer, Luke Levy, Omer Zhou, Chunting
contents	The quadratic complexity and weak length extrapolation of Transformers limits their ability to scale to long sequences, and while sub-quadratic solutions like linear attention and state space models exist, they empirically underperform Transformers in pretraining efficiency and downstream task accuracy. We introduce Megalodon, a neural architecture for efficient sequence modeling with unlimited context length. Megalodon inherits the architecture of Mega (exponential moving average with gated attention), and further introduces multiple technical components to improve its capability and stability, including complex exponential moving average (CEMA), timestep normalization layer, normalized attention mechanism and pre-norm with two-hop residual configuration. In a controlled head-to-head comparison with Llama2, Megalodon achieves better efficiency than Transformer in the scale of 7 billion parameters and 2 trillion training tokens. Megalodon reaches a training loss of 1.70, landing mid-way between Llama2-7B (1.75) and 13B (1.67). Code: https://github.com/XuezheMax/megalodon
format	Preprint
id	arxiv_https___arxiv_org_abs_2404_08801
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length Ma, Xuezhe Yang, Xiaomeng Xiong, Wenhan Chen, Beidi Yu, Lili Zhang, Hao May, Jonathan Zettlemoyer, Luke Levy, Omer Zhou, Chunting Machine Learning Computation and Language The quadratic complexity and weak length extrapolation of Transformers limits their ability to scale to long sequences, and while sub-quadratic solutions like linear attention and state space models exist, they empirically underperform Transformers in pretraining efficiency and downstream task accuracy. We introduce Megalodon, a neural architecture for efficient sequence modeling with unlimited context length. Megalodon inherits the architecture of Mega (exponential moving average with gated attention), and further introduces multiple technical components to improve its capability and stability, including complex exponential moving average (CEMA), timestep normalization layer, normalized attention mechanism and pre-norm with two-hop residual configuration. In a controlled head-to-head comparison with Llama2, Megalodon achieves better efficiency than Transformer in the scale of 7 billion parameters and 2 trillion training tokens. Megalodon reaches a training loss of 1.70, landing mid-way between Llama2-7B (1.75) and 13B (1.67). Code: https://github.com/XuezheMax/megalodon
title	Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length
topic	Machine Learning Computation and Language
url	https://arxiv.org/abs/2404.08801

Documents similaires