Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	PP, Narayanan, Iyer, Anantharaman Palacode Narayana
Format:	Preprint
Published:	2024
Subjects:	Computation and Language Artificial Intelligence F.2.2; I.2.7
Online Access:	https://arxiv.org/abs/2408.09434
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866929529185894400
author	PP, Narayanan Iyer, Anantharaman Palacode Narayana
author_facet	PP, Narayanan Iyer, Anantharaman Palacode Narayana
contents	Regulatory compliance reporting in the pharmaceutical industry relies on detailed tables, but these are often under-utilized beyond compliance due to their unstructured format and arbitrary content. Extracting and semantically representing tabular data is challenging due to diverse table presentations. Large Language Models (LLMs) demonstrate substantial potential for semantic representation, yet they encounter challenges related to accuracy and context size limitations, which are crucial considerations for the industry applications. We introduce HySem, a pipeline that employs a novel context length optimization technique to generate accurate semantic JSON representations from HTML tables. This approach utilizes a custom fine-tuned model specifically designed for cost- and privacy-sensitive small and medium pharmaceutical enterprises. Running on commodity hardware and leveraging open-source models, HySem surpasses its peer open-source models in accuracy and provides competitive performance when benchmarked against OpenAI GPT-4o and effectively addresses context length limitations, which is a crucial factor for supporting larger tables.
format	Preprint
id	arxiv_https___arxiv_org_abs_2408_09434
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	HySem: A context length optimized LLM pipeline for unstructured tabular extraction PP, Narayanan Iyer, Anantharaman Palacode Narayana Computation and Language Artificial Intelligence F.2.2; I.2.7 Regulatory compliance reporting in the pharmaceutical industry relies on detailed tables, but these are often under-utilized beyond compliance due to their unstructured format and arbitrary content. Extracting and semantically representing tabular data is challenging due to diverse table presentations. Large Language Models (LLMs) demonstrate substantial potential for semantic representation, yet they encounter challenges related to accuracy and context size limitations, which are crucial considerations for the industry applications. We introduce HySem, a pipeline that employs a novel context length optimization technique to generate accurate semantic JSON representations from HTML tables. This approach utilizes a custom fine-tuned model specifically designed for cost- and privacy-sensitive small and medium pharmaceutical enterprises. Running on commodity hardware and leveraging open-source models, HySem surpasses its peer open-source models in accuracy and provides competitive performance when benchmarked against OpenAI GPT-4o and effectively addresses context length limitations, which is a crucial factor for supporting larger tables.
title	HySem: A context length optimized LLM pipeline for unstructured tabular extraction
topic	Computation and Language Artificial Intelligence F.2.2; I.2.7
url	https://arxiv.org/abs/2408.09434

Similar Items