Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhou, Chaochao, Yang, Bo
Format:	Preprint
Published:	2022
Subjects:	Information Retrieval Machine Learning
Online Access:	https://arxiv.org/abs/2212.09044
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866918066869239808
author	Zhou, Chaochao Yang, Bo
author_facet	Zhou, Chaochao Yang, Bo
contents	Many analysis and prediction tasks require the extraction of structured data from unstructured texts. However, an annotation scheme and a training dataset have not been available for training machine learning models to mine structured data from text without special templates and patterns. To solve it, this paper presents an end-to-end machine learning pipeline, Text2Struct, including a text annotation scheme, training data processing, and machine learning implementation. We formulated the mining problem as the extraction of metrics and units associated with numerals in the text. Text2Struct was trained and evaluated using an annotated text dataset collected from abstracts of medical publications regarding thrombectomy. In terms of prediction performance, a dice coefficient of 0.82 was achieved on the test dataset. By random sampling, most predicted relations between numerals and entities were well matched to the ground-truth annotations. These results show that Text2Struct is viable for the mining of structured data from text without special templates or patterns. It is anticipated to further improve the pipeline by expanding the dataset and investigating other machine learning models. A code demonstration can be found at: https://github.com/zcc861007/Text2Struct
format	Preprint
id	arxiv_https___arxiv_org_abs_2212_09044
institution	arXiv
publishDate	2022
record_format	arxiv
spellingShingle	Text2Struct: A Machine Learning Pipeline for Mining Structured Data from Text Zhou, Chaochao Yang, Bo Information Retrieval Machine Learning Many analysis and prediction tasks require the extraction of structured data from unstructured texts. However, an annotation scheme and a training dataset have not been available for training machine learning models to mine structured data from text without special templates and patterns. To solve it, this paper presents an end-to-end machine learning pipeline, Text2Struct, including a text annotation scheme, training data processing, and machine learning implementation. We formulated the mining problem as the extraction of metrics and units associated with numerals in the text. Text2Struct was trained and evaluated using an annotated text dataset collected from abstracts of medical publications regarding thrombectomy. In terms of prediction performance, a dice coefficient of 0.82 was achieved on the test dataset. By random sampling, most predicted relations between numerals and entities were well matched to the ground-truth annotations. These results show that Text2Struct is viable for the mining of structured data from text without special templates or patterns. It is anticipated to further improve the pipeline by expanding the dataset and investigating other machine learning models. A code demonstration can be found at: https://github.com/zcc861007/Text2Struct
title	Text2Struct: A Machine Learning Pipeline for Mining Structured Data from Text
topic	Information Retrieval Machine Learning
url	https://arxiv.org/abs/2212.09044

Similar Items