Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Lee, Wonkee, Heo, Seong-Hwan, Lee, Jong-Hyeok
Format:	Preprint
Published:	2022
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2204.03896
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910468122083328
author	Lee, Wonkee Heo, Seong-Hwan Lee, Jong-Hyeok
author_facet	Lee, Wonkee Heo, Seong-Hwan Lee, Jong-Hyeok
contents	Semi-supervised learning that leverages synthetic data for training has been widely adopted for developing automatic post-editing (APE) models due to the lack of training data. With this aim, we focus on data-synthesis methods to create high-quality synthetic data. Given that APE takes as input a machine-translation result that might include errors, we present a data-synthesis method by which the resulting synthetic data mimic the translation errors found in actual data. We introduce a noising-based data-synthesis method by adapting the masked language model approach, generating a noisy text from a clean text by infilling masked tokens with erroneous tokens. Moreover, we propose selective corpus interleaving that combines two separate synthetic datasets by taking only the advantageous samples to enhance the quality of the synthetic data further. Experimental results show that using the synthetic data created by our approach results in significantly better APE performance than other synthetic data created by existing methods.
format	Preprint
id	arxiv_https___arxiv_org_abs_2204_03896
institution	arXiv
publishDate	2022
record_format	arxiv
spellingShingle	Advancing Semi-Supervised Learning for Automatic Post-Editing: Data-Synthesis by Mask-Infilling with Erroneous Terms Lee, Wonkee Heo, Seong-Hwan Lee, Jong-Hyeok Computation and Language Semi-supervised learning that leverages synthetic data for training has been widely adopted for developing automatic post-editing (APE) models due to the lack of training data. With this aim, we focus on data-synthesis methods to create high-quality synthetic data. Given that APE takes as input a machine-translation result that might include errors, we present a data-synthesis method by which the resulting synthetic data mimic the translation errors found in actual data. We introduce a noising-based data-synthesis method by adapting the masked language model approach, generating a noisy text from a clean text by infilling masked tokens with erroneous tokens. Moreover, we propose selective corpus interleaving that combines two separate synthetic datasets by taking only the advantageous samples to enhance the quality of the synthetic data further. Experimental results show that using the synthetic data created by our approach results in significantly better APE performance than other synthetic data created by existing methods.
title	Advancing Semi-Supervised Learning for Automatic Post-Editing: Data-Synthesis by Mask-Infilling with Erroneous Terms
topic	Computation and Language
url	https://arxiv.org/abs/2204.03896

Similar Items