Saved in:
Bibliographic Details
Main Authors: Hofmann, Valentin, Glavaš, Goran, Ljubešić, Nikola, Pierrehumbert, Janet B., Schütze, Hinrich
Format: Preprint
Published: 2022
Subjects:
Online Access:https://arxiv.org/abs/2203.08565
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866917576268840960
author Hofmann, Valentin
Glavaš, Goran
Ljubešić, Nikola
Pierrehumbert, Janet B.
Schütze, Hinrich
author_facet Hofmann, Valentin
Glavaš, Goran
Ljubešić, Nikola
Pierrehumbert, Janet B.
Schütze, Hinrich
contents While pretrained language models (PLMs) have been shown to possess a plethora of linguistic knowledge, the existing body of research has largely neglected extralinguistic knowledge, which is generally difficult to obtain by pretraining on text alone. Here, we contribute to closing this gap by examining geolinguistic knowledge, i.e., knowledge about geographic variation in language. We introduce geoadaptation, an intermediate training step that couples language modeling with geolocation prediction in a multi-task learning setup. We geoadapt four PLMs, covering language groups from three geographic areas, and evaluate them on five different tasks: fine-tuned (i.e., supervised) geolocation prediction, zero-shot (i.e., unsupervised) geolocation prediction, fine-tuned language identification, zero-shot language identification, and zero-shot prediction of dialect features. Geoadaptation is very successful at injecting geolinguistic knowledge into the PLMs: the geoadapted PLMs consistently outperform PLMs adapted using only language modeling (by especially wide margins on zero-shot prediction tasks), and we obtain new state-of-the-art results on two benchmarks for geolocation prediction and language identification. Furthermore, we show that the effectiveness of geoadaptation stems from its ability to geographically retrofit the representation space of the PLMs.
format Preprint
id arxiv_https___arxiv_org_abs_2203_08565
institution arXiv
publishDate 2022
record_format arxiv
spellingShingle Geographic Adaptation of Pretrained Language Models
Hofmann, Valentin
Glavaš, Goran
Ljubešić, Nikola
Pierrehumbert, Janet B.
Schütze, Hinrich
Computation and Language
While pretrained language models (PLMs) have been shown to possess a plethora of linguistic knowledge, the existing body of research has largely neglected extralinguistic knowledge, which is generally difficult to obtain by pretraining on text alone. Here, we contribute to closing this gap by examining geolinguistic knowledge, i.e., knowledge about geographic variation in language. We introduce geoadaptation, an intermediate training step that couples language modeling with geolocation prediction in a multi-task learning setup. We geoadapt four PLMs, covering language groups from three geographic areas, and evaluate them on five different tasks: fine-tuned (i.e., supervised) geolocation prediction, zero-shot (i.e., unsupervised) geolocation prediction, fine-tuned language identification, zero-shot language identification, and zero-shot prediction of dialect features. Geoadaptation is very successful at injecting geolinguistic knowledge into the PLMs: the geoadapted PLMs consistently outperform PLMs adapted using only language modeling (by especially wide margins on zero-shot prediction tasks), and we obtain new state-of-the-art results on two benchmarks for geolocation prediction and language identification. Furthermore, we show that the effectiveness of geoadaptation stems from its ability to geographically retrofit the representation space of the PLMs.
title Geographic Adaptation of Pretrained Language Models
topic Computation and Language
url https://arxiv.org/abs/2203.08565