Saved in:
Bibliographic Details
Main Authors: Ghorbanfekr, Hossein, Kerstens, Pieter Jan, Dirix, Katrijn
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2407.10991
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • Geological borehole descriptions contain detailed textual information about the composition of the subsurface. However, their unstructured format presents significant challenges for extracting relevant features into a structured format. This paper introduces GEOBERTje: a domain adapted large language model trained on geological borehole descriptions from Flanders (Belgium) in the Dutch language. This model effectively extracts relevant information from the borehole descriptions and represents it into a numeric vector space. Showcasing just one potential application of GEOBERTje, we finetune a classifier model on a limited number of manually labeled observations. This classifier categorizes borehole descriptions into a main, second and third lithology class. We show that our classifier outperforms both a rule-based approach and GPT-4 of OpenAI. This study exemplifies how domain adapted large language models enhance the efficiency and accuracy of extracting information from complex, unstructured geological descriptions. This offers new opportunities for geological analysis and modeling using vast amounts of data.