Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Scheible, Raphael, Frei, Johann, Thomczyk, Fabian, He, Henry, Tippmann, Patric, Knaus, Jochen, Jaravine, Victor, Kramer, Frank, Boeker, Martin
Format:	Preprint
Published:	2020
Subjects:	Computation and Language Machine Learning
Online Access:	https://arxiv.org/abs/2012.02110
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916072580448256
author	Scheible, Raphael Frei, Johann Thomczyk, Fabian He, Henry Tippmann, Patric Knaus, Jochen Jaravine, Victor Kramer, Frank Boeker, Martin
author_facet	Scheible, Raphael Frei, Johann Thomczyk, Fabian He, Henry Tippmann, Patric Knaus, Jochen Jaravine, Victor Kramer, Frank Boeker, Martin
contents	Pre-trained language models have significantly advanced natural language processing (NLP), especially with the introduction of BERT and its optimized version, RoBERTa. While initial research focused on English, single-language models can be advantageous compared to multilingual ones in terms of pre-training effort, overall resource efficiency or downstream task performance. Despite the growing popularity of prompt-based LLMs, more compute-efficient BERT-like models remain highly relevant. In this work, we present the first German single-language RoBERTa model, GottBERT, pre-trained exclusively on the German portion of the OSCAR dataset. Additionally, we investigated the impact of filtering the OSCAR corpus. GottBERT was pre-trained using fairseq and standard hyperparameters. We evaluated its performance on two Named Entity Recognition (NER) tasks (Conll 2003 and GermEval 2014) and three text classification tasks (GermEval 2018 fine and coarse, and 10kGNAD) against existing German BERT models and two multilingual models. Performance was measured using the $F_{1}$ score and accuracy. The GottBERT base and large models showed competitive performance, with GottBERT leading among the base models in 4 of 6 tasks. Contrary to our expectation, the applied filtering did not significantly affect the results. To support the German NLP research community, we are releasing the GottBERT models under the MIT license.
format	Preprint
id	arxiv_https___arxiv_org_abs_2012_02110
institution	arXiv
publishDate	2020
record_format	arxiv
spellingShingle	GottBERT: a pure German Language Model Scheible, Raphael Frei, Johann Thomczyk, Fabian He, Henry Tippmann, Patric Knaus, Jochen Jaravine, Victor Kramer, Frank Boeker, Martin Computation and Language Machine Learning Pre-trained language models have significantly advanced natural language processing (NLP), especially with the introduction of BERT and its optimized version, RoBERTa. While initial research focused on English, single-language models can be advantageous compared to multilingual ones in terms of pre-training effort, overall resource efficiency or downstream task performance. Despite the growing popularity of prompt-based LLMs, more compute-efficient BERT-like models remain highly relevant. In this work, we present the first German single-language RoBERTa model, GottBERT, pre-trained exclusively on the German portion of the OSCAR dataset. Additionally, we investigated the impact of filtering the OSCAR corpus. GottBERT was pre-trained using fairseq and standard hyperparameters. We evaluated its performance on two Named Entity Recognition (NER) tasks (Conll 2003 and GermEval 2014) and three text classification tasks (GermEval 2018 fine and coarse, and 10kGNAD) against existing German BERT models and two multilingual models. Performance was measured using the $F_{1}$ score and accuracy. The GottBERT base and large models showed competitive performance, with GottBERT leading among the base models in 4 of 6 tasks. Contrary to our expectation, the applied filtering did not significantly affect the results. To support the German NLP research community, we are releasing the GottBERT models under the MIT license.
title	GottBERT: a pure German Language Model
topic	Computation and Language Machine Learning
url	https://arxiv.org/abs/2012.02110

Similar Items