Salvato in:
Dettagli Bibliografici
Autori principali: Shi, Yi, Wang, Congyi, Chen, Yu, Wang, Bin
Natura: Preprint
Pubblicazione: 2021
Soggetti:
Accesso online:https://arxiv.org/abs/2102.00621
Tags: Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!
_version_ 1866916357471207424
author Shi, Yi
Wang, Congyi
Chen, Yu
Wang, Bin
author_facet Shi, Yi
Wang, Congyi
Chen, Yu
Wang, Bin
contents The majority of Chinese characters are monophonic, while a special group of characters, called polyphonic characters, have multiple pronunciations. As a prerequisite of performing speech-related generative tasks, the correct pronunciation must be identified among several candidates. This process is called Polyphone Disambiguation. Although the problem has been well explored with both knowledge-based and learning-based approaches, it remains challenging due to the lack of publicly available labeled datasets and the irregular nature of polyphone in Mandarin Chinese. In this paper, we propose a novel semi-supervised learning (SSL) framework for Mandarin Chinese polyphone disambiguation that can potentially leverage unlimited unlabeled text data. We explore the effect of various proxy labeling strategies including entropy-thresholding and lexicon-based labeling. Qualitative and quantitative experiments demonstrate that our method achieves state-of-the-art performance. In addition, we publish a novel dataset specifically for the polyphone disambiguation task to promote further research.
format Preprint
id arxiv_https___arxiv_org_abs_2102_00621
institution arXiv
publishDate 2021
record_format arxiv
spellingShingle Polyphone Disambiguation in Mandarin Chinese with Semi-Supervised Learning
Shi, Yi
Wang, Congyi
Chen, Yu
Wang, Bin
Computation and Language
Artificial Intelligence
The majority of Chinese characters are monophonic, while a special group of characters, called polyphonic characters, have multiple pronunciations. As a prerequisite of performing speech-related generative tasks, the correct pronunciation must be identified among several candidates. This process is called Polyphone Disambiguation. Although the problem has been well explored with both knowledge-based and learning-based approaches, it remains challenging due to the lack of publicly available labeled datasets and the irregular nature of polyphone in Mandarin Chinese. In this paper, we propose a novel semi-supervised learning (SSL) framework for Mandarin Chinese polyphone disambiguation that can potentially leverage unlimited unlabeled text data. We explore the effect of various proxy labeling strategies including entropy-thresholding and lexicon-based labeling. Qualitative and quantitative experiments demonstrate that our method achieves state-of-the-art performance. In addition, we publish a novel dataset specifically for the polyphone disambiguation task to promote further research.
title Polyphone Disambiguation in Mandarin Chinese with Semi-Supervised Learning
topic Computation and Language
Artificial Intelligence
url https://arxiv.org/abs/2102.00621