Enregistré dans:
Détails bibliographiques
Auteurs principaux: Ma, Xuezhe, Yang, Xiaomeng, Xiong, Wenhan, Chen, Beidi, Yu, Lili, Zhang, Hao, May, Jonathan, Zettlemoyer, Luke, Levy, Omer, Zhou, Chunting
Format: Preprint
Publié: 2024
Sujets:
Accès en ligne:https://arxiv.org/abs/2404.08801
Tags: Ajouter un tag
Pas de tags, Soyez le premier à ajouter un tag!
_version_ 1866910411760074752
author Ma, Xuezhe
Yang, Xiaomeng
Xiong, Wenhan
Chen, Beidi
Yu, Lili
Zhang, Hao
May, Jonathan
Zettlemoyer, Luke
Levy, Omer
Zhou, Chunting
author_facet Ma, Xuezhe
Yang, Xiaomeng
Xiong, Wenhan
Chen, Beidi
Yu, Lili
Zhang, Hao
May, Jonathan
Zettlemoyer, Luke
Levy, Omer
Zhou, Chunting
contents The quadratic complexity and weak length extrapolation of Transformers limits their ability to scale to long sequences, and while sub-quadratic solutions like linear attention and state space models exist, they empirically underperform Transformers in pretraining efficiency and downstream task accuracy. We introduce Megalodon, a neural architecture for efficient sequence modeling with unlimited context length. Megalodon inherits the architecture of Mega (exponential moving average with gated attention), and further introduces multiple technical components to improve its capability and stability, including complex exponential moving average (CEMA), timestep normalization layer, normalized attention mechanism and pre-norm with two-hop residual configuration. In a controlled head-to-head comparison with Llama2, Megalodon achieves better efficiency than Transformer in the scale of 7 billion parameters and 2 trillion training tokens. Megalodon reaches a training loss of 1.70, landing mid-way between Llama2-7B (1.75) and 13B (1.67). Code: https://github.com/XuezheMax/megalodon
format Preprint
id arxiv_https___arxiv_org_abs_2404_08801
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length
Ma, Xuezhe
Yang, Xiaomeng
Xiong, Wenhan
Chen, Beidi
Yu, Lili
Zhang, Hao
May, Jonathan
Zettlemoyer, Luke
Levy, Omer
Zhou, Chunting
Machine Learning
Computation and Language
The quadratic complexity and weak length extrapolation of Transformers limits their ability to scale to long sequences, and while sub-quadratic solutions like linear attention and state space models exist, they empirically underperform Transformers in pretraining efficiency and downstream task accuracy. We introduce Megalodon, a neural architecture for efficient sequence modeling with unlimited context length. Megalodon inherits the architecture of Mega (exponential moving average with gated attention), and further introduces multiple technical components to improve its capability and stability, including complex exponential moving average (CEMA), timestep normalization layer, normalized attention mechanism and pre-norm with two-hop residual configuration. In a controlled head-to-head comparison with Llama2, Megalodon achieves better efficiency than Transformer in the scale of 7 billion parameters and 2 trillion training tokens. Megalodon reaches a training loss of 1.70, landing mid-way between Llama2-7B (1.75) and 13B (1.67). Code: https://github.com/XuezheMax/megalodon
title Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length
topic Machine Learning
Computation and Language
url https://arxiv.org/abs/2404.08801