Saved in:
Bibliographic Details
Main Authors: Scheible, Raphael, Frei, Johann, Thomczyk, Fabian, He, Henry, Tippmann, Patric, Knaus, Jochen, Jaravine, Victor, Kramer, Frank, Boeker, Martin
Format: Preprint
Published: 2020
Subjects:
Online Access:https://arxiv.org/abs/2012.02110
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866916072580448256
author Scheible, Raphael
Frei, Johann
Thomczyk, Fabian
He, Henry
Tippmann, Patric
Knaus, Jochen
Jaravine, Victor
Kramer, Frank
Boeker, Martin
author_facet Scheible, Raphael
Frei, Johann
Thomczyk, Fabian
He, Henry
Tippmann, Patric
Knaus, Jochen
Jaravine, Victor
Kramer, Frank
Boeker, Martin
contents Pre-trained language models have significantly advanced natural language processing (NLP), especially with the introduction of BERT and its optimized version, RoBERTa. While initial research focused on English, single-language models can be advantageous compared to multilingual ones in terms of pre-training effort, overall resource efficiency or downstream task performance. Despite the growing popularity of prompt-based LLMs, more compute-efficient BERT-like models remain highly relevant. In this work, we present the first German single-language RoBERTa model, GottBERT, pre-trained exclusively on the German portion of the OSCAR dataset. Additionally, we investigated the impact of filtering the OSCAR corpus. GottBERT was pre-trained using fairseq and standard hyperparameters. We evaluated its performance on two Named Entity Recognition (NER) tasks (Conll 2003 and GermEval 2014) and three text classification tasks (GermEval 2018 fine and coarse, and 10kGNAD) against existing German BERT models and two multilingual models. Performance was measured using the $F_{1}$ score and accuracy. The GottBERT base and large models showed competitive performance, with GottBERT leading among the base models in 4 of 6 tasks. Contrary to our expectation, the applied filtering did not significantly affect the results. To support the German NLP research community, we are releasing the GottBERT models under the MIT license.
format Preprint
id arxiv_https___arxiv_org_abs_2012_02110
institution arXiv
publishDate 2020
record_format arxiv
spellingShingle GottBERT: a pure German Language Model
Scheible, Raphael
Frei, Johann
Thomczyk, Fabian
He, Henry
Tippmann, Patric
Knaus, Jochen
Jaravine, Victor
Kramer, Frank
Boeker, Martin
Computation and Language
Machine Learning
Pre-trained language models have significantly advanced natural language processing (NLP), especially with the introduction of BERT and its optimized version, RoBERTa. While initial research focused on English, single-language models can be advantageous compared to multilingual ones in terms of pre-training effort, overall resource efficiency or downstream task performance. Despite the growing popularity of prompt-based LLMs, more compute-efficient BERT-like models remain highly relevant. In this work, we present the first German single-language RoBERTa model, GottBERT, pre-trained exclusively on the German portion of the OSCAR dataset. Additionally, we investigated the impact of filtering the OSCAR corpus. GottBERT was pre-trained using fairseq and standard hyperparameters. We evaluated its performance on two Named Entity Recognition (NER) tasks (Conll 2003 and GermEval 2014) and three text classification tasks (GermEval 2018 fine and coarse, and 10kGNAD) against existing German BERT models and two multilingual models. Performance was measured using the $F_{1}$ score and accuracy. The GottBERT base and large models showed competitive performance, with GottBERT leading among the base models in 4 of 6 tasks. Contrary to our expectation, the applied filtering did not significantly affect the results. To support the German NLP research community, we are releasing the GottBERT models under the MIT license.
title GottBERT: a pure German Language Model
topic Computation and Language
Machine Learning
url https://arxiv.org/abs/2012.02110