Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2405.12413 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866913357782581248 |
|---|---|
| author | Downey, C. M. Blevins, Terra Serai, Dhwani Parikh, Dwija Steinert-Threlkeld, Shane |
| author_facet | Downey, C. M. Blevins, Terra Serai, Dhwani Parikh, Dwija Steinert-Threlkeld, Shane |
| contents | The "massively-multilingual" training of multilingual models is known to limit their utility in any one language, and they perform particularly poorly on low-resource languages. However, there is evidence that low-resource languages can benefit from targeted multilinguality, where the model is trained on closely related languages. To test this approach more rigorously, we systematically study best practices for adapting a pre-trained model to a language family. Focusing on the Uralic family as a test case, we adapt XLM-R under various configurations to model 15 languages; we then evaluate the performance of each experimental setting on two downstream tasks and 11 evaluation languages. Our adapted models significantly outperform mono- and multilingual baselines. Furthermore, a regression analysis of hyperparameter effects reveals that adapted vocabulary size is relatively unimportant for low-resource languages, and that low-resource languages can be aggressively up-sampled during training at little detriment to performance in high-resource languages. These results introduce new best practices for performing language adaptation in a targeted setting. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2405_12413 |
| institution | arXiv |
| publishDate | 2024 |
| record_format | arxiv |
| spellingShingle | Targeted Multilingual Adaptation for Low-resource Language Families Downey, C. M. Blevins, Terra Serai, Dhwani Parikh, Dwija Steinert-Threlkeld, Shane Computation and Language The "massively-multilingual" training of multilingual models is known to limit their utility in any one language, and they perform particularly poorly on low-resource languages. However, there is evidence that low-resource languages can benefit from targeted multilinguality, where the model is trained on closely related languages. To test this approach more rigorously, we systematically study best practices for adapting a pre-trained model to a language family. Focusing on the Uralic family as a test case, we adapt XLM-R under various configurations to model 15 languages; we then evaluate the performance of each experimental setting on two downstream tasks and 11 evaluation languages. Our adapted models significantly outperform mono- and multilingual baselines. Furthermore, a regression analysis of hyperparameter effects reveals that adapted vocabulary size is relatively unimportant for low-resource languages, and that low-resource languages can be aggressively up-sampled during training at little detriment to performance in high-resource languages. These results introduce new best practices for performing language adaptation in a targeted setting. |
| title | Targeted Multilingual Adaptation for Low-resource Language Families |
| topic | Computation and Language |
| url | https://arxiv.org/abs/2405.12413 |