Saved in:
Bibliographic Details
Main Authors: Ali, Wazir, Kumar, Jay, Tumrani, Saifullah, Nour, Redhwan, Noor, Adeeb, Xu, Zenglin
Format: Preprint
Published: 2020
Subjects:
Online Access:https://arxiv.org/abs/2012.15079
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866913491098533888
author Ali, Wazir
Kumar, Jay
Tumrani, Saifullah
Nour, Redhwan
Noor, Adeeb
Xu, Zenglin
author_facet Ali, Wazir
Kumar, Jay
Tumrani, Saifullah
Nour, Redhwan
Noor, Adeeb
Xu, Zenglin
contents Sindhi word segmentation is a challenging task due to space omission and insertion issues. The Sindhi language itself adds to this complexity. It's cursive and consists of characters with inherent joining and non-joining properties, independent of word boundaries. Existing Sindhi word segmentation methods rely on designing and combining hand-crafted features. However, these methods have limitations, such as difficulty handling out-of-vocabulary words, limited robustness for other languages, and inefficiency with large amounts of noisy or raw text. Neural network-based models, in contrast, can automatically capture word boundary information without requiring prior knowledge. In this paper, we propose a Subword-Guided Neural Word Segmenter (SGNWS) that addresses word segmentation as a sequence labeling task. The SGNWS model incorporates subword representation learning through a bidirectional long short-term memory encoder, position-aware self-attention, and a conditional random field. Our empirical results demonstrate that the SGNWS model achieves state-of-the-art performance in Sindhi word segmentation on six datasets.
format Preprint
id arxiv_https___arxiv_org_abs_2012_15079
institution arXiv
publishDate 2020
record_format arxiv
spellingShingle Enhancing Sindhi Word Segmentation using Subword Representation Learning and Position-aware Self-attention
Ali, Wazir
Kumar, Jay
Tumrani, Saifullah
Nour, Redhwan
Noor, Adeeb
Xu, Zenglin
Computation and Language
Machine Learning
Sindhi word segmentation is a challenging task due to space omission and insertion issues. The Sindhi language itself adds to this complexity. It's cursive and consists of characters with inherent joining and non-joining properties, independent of word boundaries. Existing Sindhi word segmentation methods rely on designing and combining hand-crafted features. However, these methods have limitations, such as difficulty handling out-of-vocabulary words, limited robustness for other languages, and inefficiency with large amounts of noisy or raw text. Neural network-based models, in contrast, can automatically capture word boundary information without requiring prior knowledge. In this paper, we propose a Subword-Guided Neural Word Segmenter (SGNWS) that addresses word segmentation as a sequence labeling task. The SGNWS model incorporates subword representation learning through a bidirectional long short-term memory encoder, position-aware self-attention, and a conditional random field. Our empirical results demonstrate that the SGNWS model achieves state-of-the-art performance in Sindhi word segmentation on six datasets.
title Enhancing Sindhi Word Segmentation using Subword Representation Learning and Position-aware Self-attention
topic Computation and Language
Machine Learning
url https://arxiv.org/abs/2012.15079