Saved in:
Bibliographic Details
Main Authors: Marivate, Vukosi, Dzingirai, Isheanesu, Banda, Fiskani, Lastrucci, Richard, Sindane, Thapelo, Madumo, Keabetswe, Olaleye, Kayode, Modupe, Abiodun, Netshifhefhe, Unarine, Combrink, Herkulaas, Nakeng, Mohlatlego, Ledwaba, Matome
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2508.03529
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866918181298241536
author Marivate, Vukosi
Dzingirai, Isheanesu
Banda, Fiskani
Lastrucci, Richard
Sindane, Thapelo
Madumo, Keabetswe
Olaleye, Kayode
Modupe, Abiodun
Netshifhefhe, Unarine
Combrink, Herkulaas
Nakeng, Mohlatlego
Ledwaba, Matome
author_facet Marivate, Vukosi
Dzingirai, Isheanesu
Banda, Fiskani
Lastrucci, Richard
Sindane, Thapelo
Madumo, Keabetswe
Olaleye, Kayode
Modupe, Abiodun
Netshifhefhe, Unarine
Combrink, Herkulaas
Nakeng, Mohlatlego
Ledwaba, Matome
contents The critical lack of structured terminological data for South Africa's official languages hampers progress in multilingual NLP, despite the existence of numerous government and academic terminology lists. These valuable assets remain fragmented and locked in non-machine-readable formats, rendering them unusable for computational research and development. Mafoko addresses this challenge by systematically aggregating, cleaning, and standardising these scattered resources into open, interoperable datasets. We introduce the foundational Mafoko dataset, released under the equitable, Africa-centered NOODL framework. To demonstrate its immediate utility, we integrate the terminology into a Retrieval-Augmented Generation (RAG) pipeline. Experiments show substantial improvements in the accuracy and domain-specific consistency of English-to-Tshivenda machine translation for large language models. Mafoko provides a scalable foundation for developing robust and equitable NLP technologies, ensuring South Africa's rich linguistic diversity is represented in the digital age.
format Preprint
id arxiv_https___arxiv_org_abs_2508_03529
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Mafoko: Structuring and Building Open Multilingual Terminologies for South African NLP
Marivate, Vukosi
Dzingirai, Isheanesu
Banda, Fiskani
Lastrucci, Richard
Sindane, Thapelo
Madumo, Keabetswe
Olaleye, Kayode
Modupe, Abiodun
Netshifhefhe, Unarine
Combrink, Herkulaas
Nakeng, Mohlatlego
Ledwaba, Matome
Computation and Language
The critical lack of structured terminological data for South Africa's official languages hampers progress in multilingual NLP, despite the existence of numerous government and academic terminology lists. These valuable assets remain fragmented and locked in non-machine-readable formats, rendering them unusable for computational research and development. Mafoko addresses this challenge by systematically aggregating, cleaning, and standardising these scattered resources into open, interoperable datasets. We introduce the foundational Mafoko dataset, released under the equitable, Africa-centered NOODL framework. To demonstrate its immediate utility, we integrate the terminology into a Retrieval-Augmented Generation (RAG) pipeline. Experiments show substantial improvements in the accuracy and domain-specific consistency of English-to-Tshivenda machine translation for large language models. Mafoko provides a scalable foundation for developing robust and equitable NLP technologies, ensuring South Africa's rich linguistic diversity is represented in the digital age.
title Mafoko: Structuring and Building Open Multilingual Terminologies for South African NLP
topic Computation and Language
url https://arxiv.org/abs/2508.03529