Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Mekki, Abdellah El, Magdy, Samar M., Atou, Houdaifa, AbuHweidi, Ruwa, Qawasmeh, Baraah, Nacar, Omer, Al-hibiri, Thikra, Saadie, Razan, Alsayadi, Hamzah, Hammouda, Nadia Ghezaiel, Alkhazimi, Alshima, Hamod, Aya, Al-Ghafri, Al-Yas, El-Sayed, Wesam, sharji, Asila Al, Ballout, Mohamad, Belfathi, Anas, Ghaddar, Karim, Sibaee, Serry, Aoun, Alaa, Asiri, Areej, Abureesh, Lina, Bashiti, Ahlam, Yousef, Majdal, Hafiz, Abdulaziz, Mohamed, Yehdih, Hamedtou, Emira, Brahim, Brakehe, Alhamouri, Rahaf, Nafea, Youssef, Aatar, Aya El, Al-Dhabyani, Walid, Hamed, Emhemed, Shatnawi, Sara, Alwajih, Fakhraddin, Elkhidir, Khalid, Alasmari, Ashwag, Gerrio, Abdurrahman, Alshahri, Omar, Elmadany, AbdelRahim A., Berrada, Ismail, Alkathiri, Amir Azad Adli, Zaraket, Fadi A, Jarrar, Mustafa, Hadj, Yahya Mohamed El, Alhuzali, Hassan, Abdul-Mageed, Muhammad
Format:	Preprint
Published:	2026
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2601.13099
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915944188608512
author	Mekki, Abdellah El Magdy, Samar M. Atou, Houdaifa AbuHweidi, Ruwa Qawasmeh, Baraah Nacar, Omer Al-hibiri, Thikra Saadie, Razan Alsayadi, Hamzah Hammouda, Nadia Ghezaiel Alkhazimi, Alshima Hamod, Aya Al-Ghafri, Al-Yas El-Sayed, Wesam sharji, Asila Al Ballout, Mohamad Belfathi, Anas Ghaddar, Karim Sibaee, Serry Aoun, Alaa Asiri, Areej Abureesh, Lina Bashiti, Ahlam Yousef, Majdal Hafiz, Abdulaziz Mohamed, Yehdih Hamedtou, Emira Brahim, Brakehe Alhamouri, Rahaf Nafea, Youssef Aatar, Aya El Al-Dhabyani, Walid Hamed, Emhemed Shatnawi, Sara Alwajih, Fakhraddin Elkhidir, Khalid Alasmari, Ashwag Gerrio, Abdurrahman Alshahri, Omar Elmadany, AbdelRahim A. Berrada, Ismail Alkathiri, Amir Azad Adli Zaraket, Fadi A Jarrar, Mustafa Hadj, Yahya Mohamed El Alhuzali, Hassan Abdul-Mageed, Muhammad
author_facet	Mekki, Abdellah El Magdy, Samar M. Atou, Houdaifa AbuHweidi, Ruwa Qawasmeh, Baraah Nacar, Omer Al-hibiri, Thikra Saadie, Razan Alsayadi, Hamzah Hammouda, Nadia Ghezaiel Alkhazimi, Alshima Hamod, Aya Al-Ghafri, Al-Yas El-Sayed, Wesam sharji, Asila Al Ballout, Mohamad Belfathi, Anas Ghaddar, Karim Sibaee, Serry Aoun, Alaa Asiri, Areej Abureesh, Lina Bashiti, Ahlam Yousef, Majdal Hafiz, Abdulaziz Mohamed, Yehdih Hamedtou, Emira Brahim, Brakehe Alhamouri, Rahaf Nafea, Youssef Aatar, Aya El Al-Dhabyani, Walid Hamed, Emhemed Shatnawi, Sara Alwajih, Fakhraddin Elkhidir, Khalid Alasmari, Ashwag Gerrio, Abdurrahman Alshahri, Omar Elmadany, AbdelRahim A. Berrada, Ismail Alkathiri, Amir Azad Adli Zaraket, Fadi A Jarrar, Mustafa Hadj, Yahya Mohamed El Alhuzali, Hassan Abdul-Mageed, Muhammad
contents	Arabic is a highly diglossic language where most daily communication occurs in regional dialects rather than Modern Standard Arabic (MSA). Despite this, machine translation (MT) systems often generalize poorly to dialectal input, limiting their utility for millions of speakers. We introduce Alexandria, a large-scale, community-driven, human-translated dataset designed to bridge this gap. Alexandria covers 13 Arab countries and 11 high-impact domains, including health, education, and agriculture. Unlike previous resources, Alexandria provides unprecedented granularity by associating contributions with city-of-origin metadata, capturing authentic local varieties beyond coarse regional labels. The dataset consists of parallel English-Dialectal Arabic multi-turn conversational scenarios annotated with speaker-addressee gender configurations, enabling the study of gender-conditioned variation in dialectal use. Comprising 107K total turns, Alexandria serves as both a training resource and as a rigorous benchmark for evaluating MT and Large Language Models (LLMs). Our automatic and human evaluation benchmarks the current capabilities of Arabic-aware LLMs in translating across diverse Arabic dialects and sub-dialects while exposing significant persistent challenges. The Alexandria dataset, the creation prompts, the translation and revision guidelines, and the evaluation code are publicly available in the following repository: https://github.com/UBC-NLP/Alexandria
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_13099
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Alexandria: A Multi-Domain Dialectal Arabic Machine Translation Dataset for Culturally Inclusive and Linguistically Diverse LLMs Mekki, Abdellah El Magdy, Samar M. Atou, Houdaifa AbuHweidi, Ruwa Qawasmeh, Baraah Nacar, Omer Al-hibiri, Thikra Saadie, Razan Alsayadi, Hamzah Hammouda, Nadia Ghezaiel Alkhazimi, Alshima Hamod, Aya Al-Ghafri, Al-Yas El-Sayed, Wesam sharji, Asila Al Ballout, Mohamad Belfathi, Anas Ghaddar, Karim Sibaee, Serry Aoun, Alaa Asiri, Areej Abureesh, Lina Bashiti, Ahlam Yousef, Majdal Hafiz, Abdulaziz Mohamed, Yehdih Hamedtou, Emira Brahim, Brakehe Alhamouri, Rahaf Nafea, Youssef Aatar, Aya El Al-Dhabyani, Walid Hamed, Emhemed Shatnawi, Sara Alwajih, Fakhraddin Elkhidir, Khalid Alasmari, Ashwag Gerrio, Abdurrahman Alshahri, Omar Elmadany, AbdelRahim A. Berrada, Ismail Alkathiri, Amir Azad Adli Zaraket, Fadi A Jarrar, Mustafa Hadj, Yahya Mohamed El Alhuzali, Hassan Abdul-Mageed, Muhammad Computation and Language Arabic is a highly diglossic language where most daily communication occurs in regional dialects rather than Modern Standard Arabic (MSA). Despite this, machine translation (MT) systems often generalize poorly to dialectal input, limiting their utility for millions of speakers. We introduce Alexandria, a large-scale, community-driven, human-translated dataset designed to bridge this gap. Alexandria covers 13 Arab countries and 11 high-impact domains, including health, education, and agriculture. Unlike previous resources, Alexandria provides unprecedented granularity by associating contributions with city-of-origin metadata, capturing authentic local varieties beyond coarse regional labels. The dataset consists of parallel English-Dialectal Arabic multi-turn conversational scenarios annotated with speaker-addressee gender configurations, enabling the study of gender-conditioned variation in dialectal use. Comprising 107K total turns, Alexandria serves as both a training resource and as a rigorous benchmark for evaluating MT and Large Language Models (LLMs). Our automatic and human evaluation benchmarks the current capabilities of Arabic-aware LLMs in translating across diverse Arabic dialects and sub-dialects while exposing significant persistent challenges. The Alexandria dataset, the creation prompts, the translation and revision guidelines, and the evaluation code are publicly available in the following repository: https://github.com/UBC-NLP/Alexandria
title	Alexandria: A Multi-Domain Dialectal Arabic Machine Translation Dataset for Culturally Inclusive and Linguistically Diverse LLMs
topic	Computation and Language
url	https://arxiv.org/abs/2601.13099

Similar Items