Saved in:
Bibliographic Details
Main Authors: Rowe, Jacqueline, Gow-Smith, Edward, Hepple, Mark
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2504.02674
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866912307566608384
author Rowe, Jacqueline
Gow-Smith, Edward
Hepple, Mark
author_facet Rowe, Jacqueline
Gow-Smith, Edward
Hepple, Mark
contents We introduce a new dataset for machine translation of Guinea-Bissau Creole (Kiriol), comprising around 40 thousand parallel sentences to English and Portuguese. This dataset is made up of predominantly religious data (from the Bible and texts from the Jehovah's Witnesses), but also a small amount of general domain data (from a dictionary). This mirrors the typical resource availability of many low resource languages. We train a number of transformer-based models to investigate how to improve domain transfer from religious data to a more general domain. We find that adding even 300 sentences from the target domain when training substantially improves the translation performance, highlighting the importance and need for data collection for low-resource languages, even on a small-scale. We additionally find that Portuguese-to-Kiriol translation models perform better on average than other source and target language pairs, and investigate how this relates to the morphological complexity of the languages involved and the degree of lexical overlap between creoles and lexifiers. Overall, we hope our work will stimulate research into Kiriol and into how machine translation might better support creole languages in general.
format Preprint
id arxiv_https___arxiv_org_abs_2504_02674
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Limitations of Religious Data and the Importance of the Target Domain: Towards Machine Translation for Guinea-Bissau Creole
Rowe, Jacqueline
Gow-Smith, Edward
Hepple, Mark
Computation and Language
We introduce a new dataset for machine translation of Guinea-Bissau Creole (Kiriol), comprising around 40 thousand parallel sentences to English and Portuguese. This dataset is made up of predominantly religious data (from the Bible and texts from the Jehovah's Witnesses), but also a small amount of general domain data (from a dictionary). This mirrors the typical resource availability of many low resource languages. We train a number of transformer-based models to investigate how to improve domain transfer from religious data to a more general domain. We find that adding even 300 sentences from the target domain when training substantially improves the translation performance, highlighting the importance and need for data collection for low-resource languages, even on a small-scale. We additionally find that Portuguese-to-Kiriol translation models perform better on average than other source and target language pairs, and investigate how this relates to the morphological complexity of the languages involved and the degree of lexical overlap between creoles and lexifiers. Overall, we hope our work will stimulate research into Kiriol and into how machine translation might better support creole languages in general.
title Limitations of Religious Data and the Importance of the Target Domain: Towards Machine Translation for Guinea-Bissau Creole
topic Computation and Language
url https://arxiv.org/abs/2504.02674