Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Velazquez, Diego, Grace, Mikaela, Karageorgos, Konstantinos, Carin, Lawrence, Schliem, Aaron, Zaikis, Dimitrios, Wechsler, Roger
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2511.17153
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866918213234720768
author	Velazquez, Diego Grace, Mikaela Karageorgos, Konstantinos Carin, Lawrence Schliem, Aaron Zaikis, Dimitrios Wechsler, Roger
author_facet	Velazquez, Diego Grace, Mikaela Karageorgos, Konstantinos Carin, Lawrence Schliem, Aaron Zaikis, Dimitrios Wechsler, Roger
contents	Automatic post-editing (APE) aims to correct errors in machine-translated text, enhancing translation quality, while reducing the need for human intervention. Despite advances in neural machine translation (NMT), the development of effective APE systems has been hindered by the lack of large-scale multilingual datasets specifically tailored to NMT outputs. To address this gap, we present and release LangMark, a new human-annotated multilingual APE dataset for English translation to seven languages: Brazilian Portuguese, French, German, Italian, Japanese, Russian, and Spanish. The dataset has 206,983 triplets, with each triplet consisting of a source segment, its NMT output, and a human post-edited translation. Annotated by expert human linguists, our dataset offers both linguistic diversity and scale. Leveraging this dataset, we empirically show that Large Language Models (LLMs) with few-shot prompting can effectively perform APE, improving upon leading commercial and even proprietary machine translation systems. We believe that this new resource will facilitate the future development and evaluation of APE systems.
format	Preprint
id	arxiv_https___arxiv_org_abs_2511_17153
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	LangMark: A Multilingual Dataset for Automatic Post-Editing Velazquez, Diego Grace, Mikaela Karageorgos, Konstantinos Carin, Lawrence Schliem, Aaron Zaikis, Dimitrios Wechsler, Roger Computation and Language Automatic post-editing (APE) aims to correct errors in machine-translated text, enhancing translation quality, while reducing the need for human intervention. Despite advances in neural machine translation (NMT), the development of effective APE systems has been hindered by the lack of large-scale multilingual datasets specifically tailored to NMT outputs. To address this gap, we present and release LangMark, a new human-annotated multilingual APE dataset for English translation to seven languages: Brazilian Portuguese, French, German, Italian, Japanese, Russian, and Spanish. The dataset has 206,983 triplets, with each triplet consisting of a source segment, its NMT output, and a human post-edited translation. Annotated by expert human linguists, our dataset offers both linguistic diversity and scale. Leveraging this dataset, we empirically show that Large Language Models (LLMs) with few-shot prompting can effectively perform APE, improving upon leading commercial and even proprietary machine translation systems. We believe that this new resource will facilitate the future development and evaluation of APE systems.
title	LangMark: A Multilingual Dataset for Automatic Post-Editing
topic	Computation and Language
url	https://arxiv.org/abs/2511.17153

Similar Items