Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Khan, Mohammed Safi Ur Rahman, Mehta, Priyam, Sankar, Ananth, Kumaravelan, Umashankar, Doddapaneni, Sumanth, B, Suriyaprasaad, G, Varun Balan, Jain, Sparsh, Kunchukuttan, Anoop, Kumar, Pratyush, Dabre, Raj, Khapra, Mitesh M.
Format:	Preprint
Published:	2024
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2403.06350
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866929608101724160
author	Khan, Mohammed Safi Ur Rahman Mehta, Priyam Sankar, Ananth Kumaravelan, Umashankar Doddapaneni, Sumanth B, Suriyaprasaad G, Varun Balan Jain, Sparsh Kunchukuttan, Anoop Kumar, Pratyush Dabre, Raj Khapra, Mitesh M.
author_facet	Khan, Mohammed Safi Ur Rahman Mehta, Priyam Sankar, Ananth Kumaravelan, Umashankar Doddapaneni, Sumanth B, Suriyaprasaad G, Varun Balan Jain, Sparsh Kunchukuttan, Anoop Kumar, Pratyush Dabre, Raj Khapra, Mitesh M.
contents	Despite the considerable advancements in English LLMs, the progress in building comparable models for other languages has been hindered due to the scarcity of tailored resources. Our work aims to bridge this divide by introducing an expansive suite of resources specifically designed for the development of Indic LLMs, covering 22 languages, containing a total of 251B tokens and 74.8M instruction-response pairs. Recognizing the importance of both data quality and quantity, our approach combines highly curated manually verified data, unverified yet valuable data, and synthetic data. We build a clean, open-source pipeline for curating pre-training data from diverse sources, including websites, PDFs, and videos, incorporating best practices for crawling, cleaning, flagging, and deduplication. For instruction-fine tuning, we amalgamate existing Indic datasets, translate/transliterate English datasets into Indian languages, and utilize LLaMa2 and Mixtral models to create conversations grounded in articles from Indian Wikipedia and Wikihow. Additionally, we address toxicity alignment by generating toxic prompts for multiple scenarios and then generate non-toxic responses by feeding these toxic prompts to an aligned LLaMa2 model. We hope that the datasets, tools, and resources released as a part of this work will not only propel the research and development of Indic LLMs but also establish an open-source blueprint for extending such efforts to other languages. The data and other artifacts created as part of this work are released with permissive licenses.
format	Preprint
id	arxiv_https___arxiv_org_abs_2403_06350
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	IndicLLMSuite: A Blueprint for Creating Pre-training and Fine-Tuning Datasets for Indian Languages Khan, Mohammed Safi Ur Rahman Mehta, Priyam Sankar, Ananth Kumaravelan, Umashankar Doddapaneni, Sumanth B, Suriyaprasaad G, Varun Balan Jain, Sparsh Kunchukuttan, Anoop Kumar, Pratyush Dabre, Raj Khapra, Mitesh M. Computation and Language Despite the considerable advancements in English LLMs, the progress in building comparable models for other languages has been hindered due to the scarcity of tailored resources. Our work aims to bridge this divide by introducing an expansive suite of resources specifically designed for the development of Indic LLMs, covering 22 languages, containing a total of 251B tokens and 74.8M instruction-response pairs. Recognizing the importance of both data quality and quantity, our approach combines highly curated manually verified data, unverified yet valuable data, and synthetic data. We build a clean, open-source pipeline for curating pre-training data from diverse sources, including websites, PDFs, and videos, incorporating best practices for crawling, cleaning, flagging, and deduplication. For instruction-fine tuning, we amalgamate existing Indic datasets, translate/transliterate English datasets into Indian languages, and utilize LLaMa2 and Mixtral models to create conversations grounded in articles from Indian Wikipedia and Wikihow. Additionally, we address toxicity alignment by generating toxic prompts for multiple scenarios and then generate non-toxic responses by feeding these toxic prompts to an aligned LLaMa2 model. We hope that the datasets, tools, and resources released as a part of this work will not only propel the research and development of Indic LLMs but also establish an open-source blueprint for extending such efforts to other languages. The data and other artifacts created as part of this work are released with permissive licenses.
title	IndicLLMSuite: A Blueprint for Creating Pre-training and Fine-Tuning Datasets for Indian Languages
topic	Computation and Language
url	https://arxiv.org/abs/2403.06350

Similar Items