Saved in:
Bibliographic Details
Main Authors: Alnumay, Yazeed, Barbet, Alexandre, Bialas, Anna, Darling, William, Desai, Shaan, Devassy, Joan, Duffy, Kyle, Howe, Stephanie, Lasche, Olivia, Lee, Justin, Shrinivason, Anirudh, Tracey, Jennifer
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2503.14603
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866916656582754304
author Alnumay, Yazeed
Barbet, Alexandre
Bialas, Anna
Darling, William
Desai, Shaan
Devassy, Joan
Duffy, Kyle
Howe, Stephanie
Lasche, Olivia
Lee, Justin
Shrinivason, Anirudh
Tracey, Jennifer
author_facet Alnumay, Yazeed
Barbet, Alexandre
Bialas, Anna
Darling, William
Desai, Shaan
Devassy, Joan
Duffy, Kyle
Howe, Stephanie
Lasche, Olivia
Lee, Justin
Shrinivason, Anirudh
Tracey, Jennifer
contents Building high-quality large language models (LLMs) for enterprise Arabic applications remains challenging due to the limited availability of digitized Arabic data. In this work, we present a data synthesis and refinement strategy to help address this problem, namely, by leveraging synthetic data generation and human-in-the-loop annotation to expand our Arabic training corpus. We further present our iterative post training recipe that is essential to achieving state-of-the-art performance in aligning the model with human preferences, a critical aspect to enterprise use cases. The culmination of this effort is the release of a small, 7B, open-weight model that outperforms similarly sized peers in head-to-head comparisons and on Arabic-focused benchmarks covering cultural knowledge, instruction following, RAG, and contextual faithfulness.
format Preprint
id arxiv_https___arxiv_org_abs_2503_14603
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Command R7B Arabic: A Small, Enterprise Focused, Multilingual, and Culturally Aware Arabic LLM
Alnumay, Yazeed
Barbet, Alexandre
Bialas, Anna
Darling, William
Desai, Shaan
Devassy, Joan
Duffy, Kyle
Howe, Stephanie
Lasche, Olivia
Lee, Justin
Shrinivason, Anirudh
Tracey, Jennifer
Computation and Language
Machine Learning
Building high-quality large language models (LLMs) for enterprise Arabic applications remains challenging due to the limited availability of digitized Arabic data. In this work, we present a data synthesis and refinement strategy to help address this problem, namely, by leveraging synthetic data generation and human-in-the-loop annotation to expand our Arabic training corpus. We further present our iterative post training recipe that is essential to achieving state-of-the-art performance in aligning the model with human preferences, a critical aspect to enterprise use cases. The culmination of this effort is the release of a small, 7B, open-weight model that outperforms similarly sized peers in head-to-head comparisons and on Arabic-focused benchmarks covering cultural knowledge, instruction following, RAG, and contextual faithfulness.
title Command R7B Arabic: A Small, Enterprise Focused, Multilingual, and Culturally Aware Arabic LLM
topic Computation and Language
Machine Learning
url https://arxiv.org/abs/2503.14603