Internformat: :: Library Catalog

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Rom, Aviad, Bar, Kfir
Format:	Preprint
Veröffentlicht:	2024
Schlagworte:	Computation and Language Machine Learning
Online-Zugang:	https://arxiv.org/abs/2402.16065
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

_version_	1866910343399211008
author	Rom, Aviad Bar, Kfir
author_facet	Rom, Aviad Bar, Kfir
contents	We train a bilingual Arabic-Hebrew language model using a transliterated version of Arabic texts in Hebrew, to ensure both languages are represented in the same script. Given the morphological, structural similarities, and the extensive number of cognates shared among Arabic and Hebrew, we assess the performance of a language model that employs a unified script for both languages, on machine translation which requires cross-lingual knowledge. The results are promising: our model outperforms a contrasting model which keeps the Arabic texts in the Arabic script, demonstrating the efficacy of the transliteration step. Despite being trained on a dataset approximately 60% smaller than that of other existing language models, our model appears to deliver comparable performance in machine translation across both translation directions.
format	Preprint
id	arxiv_https___arxiv_org_abs_2402_16065
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Training a Bilingual Language Model by Mapping Tokens onto a Shared Character Space Rom, Aviad Bar, Kfir Computation and Language Machine Learning We train a bilingual Arabic-Hebrew language model using a transliterated version of Arabic texts in Hebrew, to ensure both languages are represented in the same script. Given the morphological, structural similarities, and the extensive number of cognates shared among Arabic and Hebrew, we assess the performance of a language model that employs a unified script for both languages, on machine translation which requires cross-lingual knowledge. The results are promising: our model outperforms a contrasting model which keeps the Arabic texts in the Arabic script, demonstrating the efficacy of the transliteration step. Despite being trained on a dataset approximately 60% smaller than that of other existing language models, our model appears to deliver comparable performance in machine translation across both translation directions.
title	Training a Bilingual Language Model by Mapping Tokens onto a Shared Character Space
topic	Computation and Language Machine Learning
url	https://arxiv.org/abs/2402.16065

Ähnliche Einträge