Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Miao, Zhongtao, Wu, Qiyu, Zhao, Kaiyan, Wu, Zilong, Tsuruoka, Yoshimasa
Format:	Preprint
Published:	2024
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2404.02490
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866929301994078208
author	Miao, Zhongtao Wu, Qiyu Zhao, Kaiyan Wu, Zilong Tsuruoka, Yoshimasa
author_facet	Miao, Zhongtao Wu, Qiyu Zhao, Kaiyan Wu, Zilong Tsuruoka, Yoshimasa
contents	The field of cross-lingual sentence embeddings has recently experienced significant advancements, but research concerning low-resource languages has lagged due to the scarcity of parallel corpora. This paper shows that cross-lingual word representation in low-resource languages is notably under-aligned with that in high-resource languages in current models. To address this, we introduce a novel framework that explicitly aligns words between English and eight low-resource languages, utilizing off-the-shelf word alignment models. This framework incorporates three primary training objectives: aligned word prediction and word translation ranking, along with the widely used translation ranking. We evaluate our approach through experiments on the bitext retrieval task, which demonstrate substantial improvements on sentence embeddings in low-resource languages. In addition, the competitive performance of the proposed model across a broader range of tasks in high-resource languages underscores its practicality.
format	Preprint
id	arxiv_https___arxiv_org_abs_2404_02490
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Enhancing Cross-lingual Sentence Embedding for Low-resource Languages with Word Alignment Miao, Zhongtao Wu, Qiyu Zhao, Kaiyan Wu, Zilong Tsuruoka, Yoshimasa Computation and Language The field of cross-lingual sentence embeddings has recently experienced significant advancements, but research concerning low-resource languages has lagged due to the scarcity of parallel corpora. This paper shows that cross-lingual word representation in low-resource languages is notably under-aligned with that in high-resource languages in current models. To address this, we introduce a novel framework that explicitly aligns words between English and eight low-resource languages, utilizing off-the-shelf word alignment models. This framework incorporates three primary training objectives: aligned word prediction and word translation ranking, along with the widely used translation ranking. We evaluate our approach through experiments on the bitext retrieval task, which demonstrate substantial improvements on sentence embeddings in low-resource languages. In addition, the competitive performance of the proposed model across a broader range of tasks in high-resource languages underscores its practicality.
title	Enhancing Cross-lingual Sentence Embedding for Low-resource Languages with Word Alignment
topic	Computation and Language
url	https://arxiv.org/abs/2404.02490

Similar Items