_version_ 1866915944188608512
author Mekki, Abdellah El
Magdy, Samar M.
Atou, Houdaifa
AbuHweidi, Ruwa
Qawasmeh, Baraah
Nacar, Omer
Al-hibiri, Thikra
Saadie, Razan
Alsayadi, Hamzah
Hammouda, Nadia Ghezaiel
Alkhazimi, Alshima
Hamod, Aya
Al-Ghafri, Al-Yas
El-Sayed, Wesam
sharji, Asila Al
Ballout, Mohamad
Belfathi, Anas
Ghaddar, Karim
Sibaee, Serry
Aoun, Alaa
Asiri, Areej
Abureesh, Lina
Bashiti, Ahlam
Yousef, Majdal
Hafiz, Abdulaziz
Mohamed, Yehdih
Hamedtou, Emira
Brahim, Brakehe
Alhamouri, Rahaf
Nafea, Youssef
Aatar, Aya El
Al-Dhabyani, Walid
Hamed, Emhemed
Shatnawi, Sara
Alwajih, Fakhraddin
Elkhidir, Khalid
Alasmari, Ashwag
Gerrio, Abdurrahman
Alshahri, Omar
Elmadany, AbdelRahim A.
Berrada, Ismail
Alkathiri, Amir Azad Adli
Zaraket, Fadi A
Jarrar, Mustafa
Hadj, Yahya Mohamed El
Alhuzali, Hassan
Abdul-Mageed, Muhammad
author_facet Mekki, Abdellah El
Magdy, Samar M.
Atou, Houdaifa
AbuHweidi, Ruwa
Qawasmeh, Baraah
Nacar, Omer
Al-hibiri, Thikra
Saadie, Razan
Alsayadi, Hamzah
Hammouda, Nadia Ghezaiel
Alkhazimi, Alshima
Hamod, Aya
Al-Ghafri, Al-Yas
El-Sayed, Wesam
sharji, Asila Al
Ballout, Mohamad
Belfathi, Anas
Ghaddar, Karim
Sibaee, Serry
Aoun, Alaa
Asiri, Areej
Abureesh, Lina
Bashiti, Ahlam
Yousef, Majdal
Hafiz, Abdulaziz
Mohamed, Yehdih
Hamedtou, Emira
Brahim, Brakehe
Alhamouri, Rahaf
Nafea, Youssef
Aatar, Aya El
Al-Dhabyani, Walid
Hamed, Emhemed
Shatnawi, Sara
Alwajih, Fakhraddin
Elkhidir, Khalid
Alasmari, Ashwag
Gerrio, Abdurrahman
Alshahri, Omar
Elmadany, AbdelRahim A.
Berrada, Ismail
Alkathiri, Amir Azad Adli
Zaraket, Fadi A
Jarrar, Mustafa
Hadj, Yahya Mohamed El
Alhuzali, Hassan
Abdul-Mageed, Muhammad
contents Arabic is a highly diglossic language where most daily communication occurs in regional dialects rather than Modern Standard Arabic (MSA). Despite this, machine translation (MT) systems often generalize poorly to dialectal input, limiting their utility for millions of speakers. We introduce Alexandria, a large-scale, community-driven, human-translated dataset designed to bridge this gap. Alexandria covers 13 Arab countries and 11 high-impact domains, including health, education, and agriculture. Unlike previous resources, Alexandria provides unprecedented granularity by associating contributions with city-of-origin metadata, capturing authentic local varieties beyond coarse regional labels. The dataset consists of parallel English-Dialectal Arabic multi-turn conversational scenarios annotated with speaker-addressee gender configurations, enabling the study of gender-conditioned variation in dialectal use. Comprising 107K total turns, Alexandria serves as both a training resource and as a rigorous benchmark for evaluating MT and Large Language Models (LLMs). Our automatic and human evaluation benchmarks the current capabilities of Arabic-aware LLMs in translating across diverse Arabic dialects and sub-dialects while exposing significant persistent challenges. The Alexandria dataset, the creation prompts, the translation and revision guidelines, and the evaluation code are publicly available in the following repository: https://github.com/UBC-NLP/Alexandria
format Preprint
id arxiv_https___arxiv_org_abs_2601_13099
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Alexandria: A Multi-Domain Dialectal Arabic Machine Translation Dataset for Culturally Inclusive and Linguistically Diverse LLMs
Mekki, Abdellah El
Magdy, Samar M.
Atou, Houdaifa
AbuHweidi, Ruwa
Qawasmeh, Baraah
Nacar, Omer
Al-hibiri, Thikra
Saadie, Razan
Alsayadi, Hamzah
Hammouda, Nadia Ghezaiel
Alkhazimi, Alshima
Hamod, Aya
Al-Ghafri, Al-Yas
El-Sayed, Wesam
sharji, Asila Al
Ballout, Mohamad
Belfathi, Anas
Ghaddar, Karim
Sibaee, Serry
Aoun, Alaa
Asiri, Areej
Abureesh, Lina
Bashiti, Ahlam
Yousef, Majdal
Hafiz, Abdulaziz
Mohamed, Yehdih
Hamedtou, Emira
Brahim, Brakehe
Alhamouri, Rahaf
Nafea, Youssef
Aatar, Aya El
Al-Dhabyani, Walid
Hamed, Emhemed
Shatnawi, Sara
Alwajih, Fakhraddin
Elkhidir, Khalid
Alasmari, Ashwag
Gerrio, Abdurrahman
Alshahri, Omar
Elmadany, AbdelRahim A.
Berrada, Ismail
Alkathiri, Amir Azad Adli
Zaraket, Fadi A
Jarrar, Mustafa
Hadj, Yahya Mohamed El
Alhuzali, Hassan
Abdul-Mageed, Muhammad
Computation and Language
Arabic is a highly diglossic language where most daily communication occurs in regional dialects rather than Modern Standard Arabic (MSA). Despite this, machine translation (MT) systems often generalize poorly to dialectal input, limiting their utility for millions of speakers. We introduce Alexandria, a large-scale, community-driven, human-translated dataset designed to bridge this gap. Alexandria covers 13 Arab countries and 11 high-impact domains, including health, education, and agriculture. Unlike previous resources, Alexandria provides unprecedented granularity by associating contributions with city-of-origin metadata, capturing authentic local varieties beyond coarse regional labels. The dataset consists of parallel English-Dialectal Arabic multi-turn conversational scenarios annotated with speaker-addressee gender configurations, enabling the study of gender-conditioned variation in dialectal use. Comprising 107K total turns, Alexandria serves as both a training resource and as a rigorous benchmark for evaluating MT and Large Language Models (LLMs). Our automatic and human evaluation benchmarks the current capabilities of Arabic-aware LLMs in translating across diverse Arabic dialects and sub-dialects while exposing significant persistent challenges. The Alexandria dataset, the creation prompts, the translation and revision guidelines, and the evaluation code are publicly available in the following repository: https://github.com/UBC-NLP/Alexandria
title Alexandria: A Multi-Domain Dialectal Arabic Machine Translation Dataset for Culturally Inclusive and Linguistically Diverse LLMs
topic Computation and Language
url https://arxiv.org/abs/2601.13099