Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2020
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2012.15079 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866913491098533888 |
|---|---|
| author | Ali, Wazir Kumar, Jay Tumrani, Saifullah Nour, Redhwan Noor, Adeeb Xu, Zenglin |
| author_facet | Ali, Wazir Kumar, Jay Tumrani, Saifullah Nour, Redhwan Noor, Adeeb Xu, Zenglin |
| contents | Sindhi word segmentation is a challenging task due to space omission and insertion issues. The Sindhi language itself adds to this complexity. It's cursive and consists of characters with inherent joining and non-joining properties, independent of word boundaries. Existing Sindhi word segmentation methods rely on designing and combining hand-crafted features. However, these methods have limitations, such as difficulty handling out-of-vocabulary words, limited robustness for other languages, and inefficiency with large amounts of noisy or raw text. Neural network-based models, in contrast, can automatically capture word boundary information without requiring prior knowledge. In this paper, we propose a Subword-Guided Neural Word Segmenter (SGNWS) that addresses word segmentation as a sequence labeling task. The SGNWS model incorporates subword representation learning through a bidirectional long short-term memory encoder, position-aware self-attention, and a conditional random field. Our empirical results demonstrate that the SGNWS model achieves state-of-the-art performance in Sindhi word segmentation on six datasets. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2012_15079 |
| institution | arXiv |
| publishDate | 2020 |
| record_format | arxiv |
| spellingShingle | Enhancing Sindhi Word Segmentation using Subword Representation Learning and Position-aware Self-attention Ali, Wazir Kumar, Jay Tumrani, Saifullah Nour, Redhwan Noor, Adeeb Xu, Zenglin Computation and Language Machine Learning Sindhi word segmentation is a challenging task due to space omission and insertion issues. The Sindhi language itself adds to this complexity. It's cursive and consists of characters with inherent joining and non-joining properties, independent of word boundaries. Existing Sindhi word segmentation methods rely on designing and combining hand-crafted features. However, these methods have limitations, such as difficulty handling out-of-vocabulary words, limited robustness for other languages, and inefficiency with large amounts of noisy or raw text. Neural network-based models, in contrast, can automatically capture word boundary information without requiring prior knowledge. In this paper, we propose a Subword-Guided Neural Word Segmenter (SGNWS) that addresses word segmentation as a sequence labeling task. The SGNWS model incorporates subword representation learning through a bidirectional long short-term memory encoder, position-aware self-attention, and a conditional random field. Our empirical results demonstrate that the SGNWS model achieves state-of-the-art performance in Sindhi word segmentation on six datasets. |
| title | Enhancing Sindhi Word Segmentation using Subword Representation Learning and Position-aware Self-attention |
| topic | Computation and Language Machine Learning |
| url | https://arxiv.org/abs/2012.15079 |