Saved in:
Bibliographic Details
Main Author: Pavlova, Vera
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2501.10175
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866929680186081280
author Pavlova, Vera
author_facet Pavlova, Vera
contents This study examines the use of Natural Language Processing (NLP) technology within the Islamic domain, focusing on developing an Islamic neural retrieval model. By leveraging the robust XLM-R model, the research employs a language reduction technique to create a lightweight bilingual large language model (LLM). Our approach for domain adaptation addresses the unique challenges faced in the Islamic domain, where substantial in-domain corpora exist only in Arabic while limited in other languages, including English. The work utilizes a multi-stage training process for retrieval models, incorporating large retrieval datasets, such as MS MARCO, and smaller, in-domain datasets to improve retrieval performance. Additionally, we have curated an in-domain retrieval dataset in English by employing data augmentation techniques and involving a reliable Islamic source. This approach enhances the domain-specific dataset for retrieval, leading to further performance gains. The findings suggest that combining domain adaptation and a multi-stage training method for the bilingual Islamic neural retrieval model enables it to outperform monolingual models on downstream retrieval tasks.
format Preprint
id arxiv_https___arxiv_org_abs_2501_10175
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Multi-stage Training of Bilingual Islamic LLM for Neural Passage Retrieval
Pavlova, Vera
Computation and Language
This study examines the use of Natural Language Processing (NLP) technology within the Islamic domain, focusing on developing an Islamic neural retrieval model. By leveraging the robust XLM-R model, the research employs a language reduction technique to create a lightweight bilingual large language model (LLM). Our approach for domain adaptation addresses the unique challenges faced in the Islamic domain, where substantial in-domain corpora exist only in Arabic while limited in other languages, including English. The work utilizes a multi-stage training process for retrieval models, incorporating large retrieval datasets, such as MS MARCO, and smaller, in-domain datasets to improve retrieval performance. Additionally, we have curated an in-domain retrieval dataset in English by employing data augmentation techniques and involving a reliable Islamic source. This approach enhances the domain-specific dataset for retrieval, leading to further performance gains. The findings suggest that combining domain adaptation and a multi-stage training method for the bilingual Islamic neural retrieval model enables it to outperform monolingual models on downstream retrieval tasks.
title Multi-stage Training of Bilingual Islamic LLM for Neural Passage Retrieval
topic Computation and Language
url https://arxiv.org/abs/2501.10175