Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Alnumay, Yazeed, Barbet, Alexandre, Bialas, Anna, Darling, William, Desai, Shaan, Devassy, Joan, Duffy, Kyle, Howe, Stephanie, Lasche, Olivia, Lee, Justin, Shrinivason, Anirudh, Tracey, Jennifer
Format:	Preprint
Published:	2025
Subjects:	Computation and Language Machine Learning
Online Access:	https://arxiv.org/abs/2503.14603
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916656582754304
author	Alnumay, Yazeed Barbet, Alexandre Bialas, Anna Darling, William Desai, Shaan Devassy, Joan Duffy, Kyle Howe, Stephanie Lasche, Olivia Lee, Justin Shrinivason, Anirudh Tracey, Jennifer
author_facet	Alnumay, Yazeed Barbet, Alexandre Bialas, Anna Darling, William Desai, Shaan Devassy, Joan Duffy, Kyle Howe, Stephanie Lasche, Olivia Lee, Justin Shrinivason, Anirudh Tracey, Jennifer
contents	Building high-quality large language models (LLMs) for enterprise Arabic applications remains challenging due to the limited availability of digitized Arabic data. In this work, we present a data synthesis and refinement strategy to help address this problem, namely, by leveraging synthetic data generation and human-in-the-loop annotation to expand our Arabic training corpus. We further present our iterative post training recipe that is essential to achieving state-of-the-art performance in aligning the model with human preferences, a critical aspect to enterprise use cases. The culmination of this effort is the release of a small, 7B, open-weight model that outperforms similarly sized peers in head-to-head comparisons and on Arabic-focused benchmarks covering cultural knowledge, instruction following, RAG, and contextual faithfulness.
format	Preprint
id	arxiv_https___arxiv_org_abs_2503_14603
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Command R7B Arabic: A Small, Enterprise Focused, Multilingual, and Culturally Aware Arabic LLM Alnumay, Yazeed Barbet, Alexandre Bialas, Anna Darling, William Desai, Shaan Devassy, Joan Duffy, Kyle Howe, Stephanie Lasche, Olivia Lee, Justin Shrinivason, Anirudh Tracey, Jennifer Computation and Language Machine Learning Building high-quality large language models (LLMs) for enterprise Arabic applications remains challenging due to the limited availability of digitized Arabic data. In this work, we present a data synthesis and refinement strategy to help address this problem, namely, by leveraging synthetic data generation and human-in-the-loop annotation to expand our Arabic training corpus. We further present our iterative post training recipe that is essential to achieving state-of-the-art performance in aligning the model with human preferences, a critical aspect to enterprise use cases. The culmination of this effort is the release of a small, 7B, open-weight model that outperforms similarly sized peers in head-to-head comparisons and on Arabic-focused benchmarks covering cultural knowledge, instruction following, RAG, and contextual faithfulness.
title	Command R7B Arabic: A Small, Enterprise Focused, Multilingual, and Culturally Aware Arabic LLM
topic	Computation and Language Machine Learning
url	https://arxiv.org/abs/2503.14603

Similar Items