Saved in:
Bibliographic Details
Main Authors: Fernandes, Reuben Chagas, Patkar, Gaurang S.
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2603.23529
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866912981289271296
author Fernandes, Reuben Chagas
Patkar, Gaurang S.
author_facet Fernandes, Reuben Chagas
Patkar, Gaurang S.
contents Large Language Models (LLMs) consistently under perform in low-resource linguistic contexts such as Konkani. This performance deficit stems from acute training data scarcity compounded by high script diversity across Devanagari, Romi and Kannada orthographies. To address this gap, we introduce Konkani-Instruct-100k, a comprehensive synthetic instruction-tuning dataset generated through Gemini 3. We establish rigorous baseline benchmarks by evaluating leading open-weights architectures including Llama 3.1, Qwen2.5 and Gemma 3 alongside proprietary closed-source models. Our primary contribution involves the development of Konkani LLM, a series of fine-tuned models optimized for regional nuances. Furthermore, we are developing the Multi-Script Konkani Benchmark to facilitate cross-script linguistic evaluation. In machine translation, Konkani LLM delivers consistent gains over the corresponding base models and is competitive with and in several settings surpasses proprietary baselines
format Preprint
id arxiv_https___arxiv_org_abs_2603_23529
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Konkani LLM: Multi-Script Instruction Tuning and Evaluation for a Low-Resource Indian Language
Fernandes, Reuben Chagas
Patkar, Gaurang S.
Computation and Language
Artificial Intelligence
Large Language Models (LLMs) consistently under perform in low-resource linguistic contexts such as Konkani. This performance deficit stems from acute training data scarcity compounded by high script diversity across Devanagari, Romi and Kannada orthographies. To address this gap, we introduce Konkani-Instruct-100k, a comprehensive synthetic instruction-tuning dataset generated through Gemini 3. We establish rigorous baseline benchmarks by evaluating leading open-weights architectures including Llama 3.1, Qwen2.5 and Gemma 3 alongside proprietary closed-source models. Our primary contribution involves the development of Konkani LLM, a series of fine-tuned models optimized for regional nuances. Furthermore, we are developing the Multi-Script Konkani Benchmark to facilitate cross-script linguistic evaluation. In machine translation, Konkani LLM delivers consistent gains over the corresponding base models and is competitive with and in several settings surpasses proprietary baselines
title Konkani LLM: Multi-Script Instruction Tuning and Evaluation for a Low-Resource Indian Language
topic Computation and Language
Artificial Intelligence
url https://arxiv.org/abs/2603.23529