Saved in:
Bibliographic Details
Main Authors: Rodrigues, João, Gomes, Luís, Silva, João, Branco, António, Santos, Rodrigo, Cardoso, Henrique Lopes, Osório, Tomás
Format: Preprint
Published: 2023
Subjects:
Online Access:https://arxiv.org/abs/2305.06721
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • To advance the neural encoding of Portuguese (PT), and a fortiori the technological preparation of this language for the digital age, we developed a Transformer-based foundation model that sets a new state of the art in this respect for two of its variants, namely European Portuguese from Portugal (PT-PT) and American Portuguese from Brazil (PT-BR). To develop this encoder, which we named Albertina PT-*, a strong model was used as a starting point, DeBERTa, and its pre-training was done over data sets of Portuguese, namely over data sets we gathered for PT-PT and PT-BR, and over the brWaC corpus for PT-BR. The performance of Albertina and competing models was assessed by evaluating them on prominent downstream language processing tasks adapted for Portuguese. Both Albertina PT-PT and PT-BR versions are distributed free of charge and under the most permissive license possible and can be run on consumer-grade hardware, thus seeking to contribute to the advancement of research and innovation in language technology for Portuguese.