Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2410.19637 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866908570119831552 |
|---|---|
| author | Rende, Riccardo Gerace, Federica Laio, Alessandro Goldt, Sebastian |
| author_facet | Rende, Riccardo Gerace, Federica Laio, Alessandro Goldt, Sebastian |
| contents | The remarkable capability of over-parameterised neural networks to generalise effectively has been explained by invoking a ``simplicity bias'': neural networks prevent overfitting by initially learning simple classifiers before progressing to more complex, non-linear functions. While simplicity biases have been described theoretically and experimentally in feed-forward networks for supervised learning, the extent to which they also explain the remarkable success of transformers trained with self-supervised techniques remains unclear. In our study, we demonstrate that transformers, trained on natural language data, also display a simplicity bias. Specifically, they sequentially learn many-body interactions among input tokens, reaching a saturation point in the prediction error for low-degree interactions while continuing to learn high-degree interactions. To conduct this analysis, we develop a procedure to generate \textit{clones} of a given natural language data set, which rigorously capture the interactions between tokens up to a specified order. This approach opens up the possibilities of studying how interactions of different orders in the data affect learning, in natural language processing and beyond. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2410_19637 |
| institution | arXiv |
| publishDate | 2024 |
| record_format | arxiv |
| spellingShingle | A distributional simplicity bias in the learning dynamics of transformers Rende, Riccardo Gerace, Federica Laio, Alessandro Goldt, Sebastian Computation and Language The remarkable capability of over-parameterised neural networks to generalise effectively has been explained by invoking a ``simplicity bias'': neural networks prevent overfitting by initially learning simple classifiers before progressing to more complex, non-linear functions. While simplicity biases have been described theoretically and experimentally in feed-forward networks for supervised learning, the extent to which they also explain the remarkable success of transformers trained with self-supervised techniques remains unclear. In our study, we demonstrate that transformers, trained on natural language data, also display a simplicity bias. Specifically, they sequentially learn many-body interactions among input tokens, reaching a saturation point in the prediction error for low-degree interactions while continuing to learn high-degree interactions. To conduct this analysis, we develop a procedure to generate \textit{clones} of a given natural language data set, which rigorously capture the interactions between tokens up to a specified order. This approach opens up the possibilities of studying how interactions of different orders in the data affect learning, in natural language processing and beyond. |
| title | A distributional simplicity bias in the learning dynamics of transformers |
| topic | Computation and Language |
| url | https://arxiv.org/abs/2410.19637 |