Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Hofmann, Valentin, Glavaš, Goran, Ljubešić, Nikola, Pierrehumbert, Janet B., Schütze, Hinrich
Format:	Preprint
Published:	2022
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2203.08565
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917576268840960
author	Hofmann, Valentin Glavaš, Goran Ljubešić, Nikola Pierrehumbert, Janet B. Schütze, Hinrich
author_facet	Hofmann, Valentin Glavaš, Goran Ljubešić, Nikola Pierrehumbert, Janet B. Schütze, Hinrich
contents	While pretrained language models (PLMs) have been shown to possess a plethora of linguistic knowledge, the existing body of research has largely neglected extralinguistic knowledge, which is generally difficult to obtain by pretraining on text alone. Here, we contribute to closing this gap by examining geolinguistic knowledge, i.e., knowledge about geographic variation in language. We introduce geoadaptation, an intermediate training step that couples language modeling with geolocation prediction in a multi-task learning setup. We geoadapt four PLMs, covering language groups from three geographic areas, and evaluate them on five different tasks: fine-tuned (i.e., supervised) geolocation prediction, zero-shot (i.e., unsupervised) geolocation prediction, fine-tuned language identification, zero-shot language identification, and zero-shot prediction of dialect features. Geoadaptation is very successful at injecting geolinguistic knowledge into the PLMs: the geoadapted PLMs consistently outperform PLMs adapted using only language modeling (by especially wide margins on zero-shot prediction tasks), and we obtain new state-of-the-art results on two benchmarks for geolocation prediction and language identification. Furthermore, we show that the effectiveness of geoadaptation stems from its ability to geographically retrofit the representation space of the PLMs.
format	Preprint
id	arxiv_https___arxiv_org_abs_2203_08565
institution	arXiv
publishDate	2022
record_format	arxiv
spellingShingle	Geographic Adaptation of Pretrained Language Models Hofmann, Valentin Glavaš, Goran Ljubešić, Nikola Pierrehumbert, Janet B. Schütze, Hinrich Computation and Language While pretrained language models (PLMs) have been shown to possess a plethora of linguistic knowledge, the existing body of research has largely neglected extralinguistic knowledge, which is generally difficult to obtain by pretraining on text alone. Here, we contribute to closing this gap by examining geolinguistic knowledge, i.e., knowledge about geographic variation in language. We introduce geoadaptation, an intermediate training step that couples language modeling with geolocation prediction in a multi-task learning setup. We geoadapt four PLMs, covering language groups from three geographic areas, and evaluate them on five different tasks: fine-tuned (i.e., supervised) geolocation prediction, zero-shot (i.e., unsupervised) geolocation prediction, fine-tuned language identification, zero-shot language identification, and zero-shot prediction of dialect features. Geoadaptation is very successful at injecting geolinguistic knowledge into the PLMs: the geoadapted PLMs consistently outperform PLMs adapted using only language modeling (by especially wide margins on zero-shot prediction tasks), and we obtain new state-of-the-art results on two benchmarks for geolocation prediction and language identification. Furthermore, we show that the effectiveness of geoadaptation stems from its ability to geographically retrofit the representation space of the PLMs.
title	Geographic Adaptation of Pretrained Language Models
topic	Computation and Language
url	https://arxiv.org/abs/2203.08565

Similar Items