Saved in:
Bibliographic Details
Main Authors: Wang, Mingze, E, Weinan
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2402.00522
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866909371047346176
author Wang, Mingze
E, Weinan
author_facet Wang, Mingze
E, Weinan
contents We conduct a systematic study of the approximation properties of Transformer for sequence modeling with long, sparse and complicated memory. We investigate the mechanisms through which different components of Transformer, such as the dot-product self-attention, positional encoding and feed-forward layer, affect its expressive power, and we study their combined effects through establishing explicit approximation rates. Our study reveals the roles of critical parameters in the Transformer, such as the number of layers and the number of attention heads. These theoretical insights are validated experimentally and offer natural suggestions for alternative architectures.
format Preprint
id arxiv_https___arxiv_org_abs_2402_00522
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Understanding the Expressive Power and Mechanisms of Transformer for Sequence Modeling
Wang, Mingze
E, Weinan
Machine Learning
We conduct a systematic study of the approximation properties of Transformer for sequence modeling with long, sparse and complicated memory. We investigate the mechanisms through which different components of Transformer, such as the dot-product self-attention, positional encoding and feed-forward layer, affect its expressive power, and we study their combined effects through establishing explicit approximation rates. Our study reveals the roles of critical parameters in the Transformer, such as the number of layers and the number of attention heads. These theoretical insights are validated experimentally and offer natural suggestions for alternative architectures.
title Understanding the Expressive Power and Mechanisms of Transformer for Sequence Modeling
topic Machine Learning
url https://arxiv.org/abs/2402.00522