Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Taiwo, Samuel, Yusoff, Mohd Amaluddin
Format:	Preprint
Published:	2026
Subjects:	Information Retrieval Artificial Intelligence
Online Access:	https://arxiv.org/abs/2603.24556
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912982331555840
author	Taiwo, Samuel Yusoff, Mohd Amaluddin
author_facet	Taiwo, Samuel Yusoff, Mohd Amaluddin
contents	Retrieval-Augmented Generation (RAG) has emerged as a framework to address the constraints of Large Language Models (LLMs). Yet, its effectiveness fundamentally hinges on document chunking - an often-overlooked determinant of its quality. This paper presents an empirical study quantifying performance differences across four chunking strategies: fixed-size sliding window, recursive, breakpoint-based semantic, and structure-aware. We evaluated these methods using a proprietary corpus of oil and gas enterprise documents, including text-heavy manuals, table-heavy specifications, and piping and instrumentation diagrams (P and IDs). Our findings show that structure-aware chunking yields higher overall retrieval effectiveness, particularly in top-K metrics, and incurs significantly lower computational costs than semantic or baseline strategies. Crucially, all four methods demonstrated limited effectiveness on P and IDs, underscoring a core limitation of purely text-based RAG within visually and spatially encoded documents. We conclude that while explicit structure preservation is essential for specialised domains, future work must integrate multimodal models to overcome current limitations.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_24556
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Evaluating Chunking Strategies For Retrieval-Augmented Generation in Oil and Gas Enterprise Documents Taiwo, Samuel Yusoff, Mohd Amaluddin Information Retrieval Artificial Intelligence Retrieval-Augmented Generation (RAG) has emerged as a framework to address the constraints of Large Language Models (LLMs). Yet, its effectiveness fundamentally hinges on document chunking - an often-overlooked determinant of its quality. This paper presents an empirical study quantifying performance differences across four chunking strategies: fixed-size sliding window, recursive, breakpoint-based semantic, and structure-aware. We evaluated these methods using a proprietary corpus of oil and gas enterprise documents, including text-heavy manuals, table-heavy specifications, and piping and instrumentation diagrams (P and IDs). Our findings show that structure-aware chunking yields higher overall retrieval effectiveness, particularly in top-K metrics, and incurs significantly lower computational costs than semantic or baseline strategies. Crucially, all four methods demonstrated limited effectiveness on P and IDs, underscoring a core limitation of purely text-based RAG within visually and spatially encoded documents. We conclude that while explicit structure preservation is essential for specialised domains, future work must integrate multimodal models to overcome current limitations.
title	Evaluating Chunking Strategies For Retrieval-Augmented Generation in Oil and Gas Enterprise Documents
topic	Information Retrieval Artificial Intelligence
url	https://arxiv.org/abs/2603.24556

Similar Items