Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Basoz, Merve, Horne, Andrew, Opper, Mattia
Format:	Preprint
Published:	2026
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2603.01732
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910057025765376
author	Basoz, Merve Horne, Andrew Opper, Mattia
author_facet	Basoz, Merve Horne, Andrew Opper, Mattia
contents	Embedding models are crucial to modern NLP. However, the creation of the most effective models relies on carefully constructed supervised finetuning data. For high resource languages, such as English, such datasets are readily available. However, for hundreds of other languages, they are simply non-existent. We investigate whether the advent of large language models can help to bridge this gap. We test three different strategies for generating synthetic triplet data used to optimise embedding models. These include in-context learning as well as two novel approaches, leveraging adapter composition and cross lingual finetuning of the LLM generator (XL-LoRA) respectively. We find that while in-context learning still falls short of strong non-synthetic baselines, adapter composition and XL-LoRA yield strong performance gains across a wide array of tasks and languages, offering a clear, scalable pathway to producing performant embedding models for a wide variety of languages.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_01732
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Bootstrapping Embeddings for Low Resource Languages Basoz, Merve Horne, Andrew Opper, Mattia Computation and Language Embedding models are crucial to modern NLP. However, the creation of the most effective models relies on carefully constructed supervised finetuning data. For high resource languages, such as English, such datasets are readily available. However, for hundreds of other languages, they are simply non-existent. We investigate whether the advent of large language models can help to bridge this gap. We test three different strategies for generating synthetic triplet data used to optimise embedding models. These include in-context learning as well as two novel approaches, leveraging adapter composition and cross lingual finetuning of the LLM generator (XL-LoRA) respectively. We find that while in-context learning still falls short of strong non-synthetic baselines, adapter composition and XL-LoRA yield strong performance gains across a wide array of tasks and languages, offering a clear, scalable pathway to producing performant embedding models for a wide variety of languages.
title	Bootstrapping Embeddings for Low Resource Languages
topic	Computation and Language
url	https://arxiv.org/abs/2603.01732

Similar Items