Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Pyo, Jiyoon, Chiang, Yao-Yi
Format:	Preprint
Published:	2024
Subjects:	Information Retrieval Artificial Intelligence Computation and Language
Online Access:	https://arxiv.org/abs/2412.03575
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915049181806592
author	Pyo, Jiyoon Chiang, Yao-Yi
author_facet	Pyo, Jiyoon Chiang, Yao-Yi
contents	Record linkage integrates diverse data sources by identifying records that refer to the same entity. In the context of mineral site records, accurate record linkage is crucial for identifying and mapping mineral deposits. Properly linking records that refer to the same mineral deposit helps define the spatial coverage of mineral areas, benefiting resource identification and site data archiving. Mineral site record linkage falls under the spatial record linkage category since the records contain information about the physical locations and non-spatial attributes in a tabular format. The task is particularly challenging due to the heterogeneity and vast scale of the data. While prior research employs pre-trained discriminative language models (PLMs) on spatial entity linkage, they often require substantial amounts of curated ground-truth data for fine-tuning. Gathering and creating ground truth data is both time-consuming and costly. Therefore, such approaches are not always feasible in real-world scenarios where gold-standard data are unavailable. Although large generative language models (LLMs) have shown promising results in various natural language processing tasks, including record linkage, their high inference time and resource demand present challenges. We propose a method that leverages an LLM to generate training data and fine-tune a PLM to address the training data gap while preserving the efficiency of PLMs. Our approach achieves over 45\% improvement in F1 score for record linkage compared to traditional PLM-based methods using ground truth data while reducing the inference time by nearly 18 times compared to relying on LLMs. Additionally, we offer an automated pipeline that eliminates the need for human intervention, highlighting this approach's potential to overcome record linkage challenges.
format	Preprint
id	arxiv_https___arxiv_org_abs_2412_03575
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Leveraging Large Language Models for Generating Labeled Mineral Site Record Linkage Data Pyo, Jiyoon Chiang, Yao-Yi Information Retrieval Artificial Intelligence Computation and Language Record linkage integrates diverse data sources by identifying records that refer to the same entity. In the context of mineral site records, accurate record linkage is crucial for identifying and mapping mineral deposits. Properly linking records that refer to the same mineral deposit helps define the spatial coverage of mineral areas, benefiting resource identification and site data archiving. Mineral site record linkage falls under the spatial record linkage category since the records contain information about the physical locations and non-spatial attributes in a tabular format. The task is particularly challenging due to the heterogeneity and vast scale of the data. While prior research employs pre-trained discriminative language models (PLMs) on spatial entity linkage, they often require substantial amounts of curated ground-truth data for fine-tuning. Gathering and creating ground truth data is both time-consuming and costly. Therefore, such approaches are not always feasible in real-world scenarios where gold-standard data are unavailable. Although large generative language models (LLMs) have shown promising results in various natural language processing tasks, including record linkage, their high inference time and resource demand present challenges. We propose a method that leverages an LLM to generate training data and fine-tune a PLM to address the training data gap while preserving the efficiency of PLMs. Our approach achieves over 45\% improvement in F1 score for record linkage compared to traditional PLM-based methods using ground truth data while reducing the inference time by nearly 18 times compared to relying on LLMs. Additionally, we offer an automated pipeline that eliminates the need for human intervention, highlighting this approach's potential to overcome record linkage challenges.
title	Leveraging Large Language Models for Generating Labeled Mineral Site Record Linkage Data
topic	Information Retrieval Artificial Intelligence Computation and Language
url	https://arxiv.org/abs/2412.03575

Similar Items