Saved in:
Bibliographic Details
Main Authors: Rayo, Jhon, de la Rosa, Raul, Garrido, Mario
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2502.16767
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866916627214237696
author Rayo, Jhon
de la Rosa, Raul
Garrido, Mario
author_facet Rayo, Jhon
de la Rosa, Raul
Garrido, Mario
contents Regulatory texts are inherently long and complex, presenting significant challenges for information retrieval systems in supporting regulatory officers with compliance tasks. This paper introduces a hybrid information retrieval system that combines lexical and semantic search techniques to extract relevant information from large regulatory corpora. The system integrates a fine-tuned sentence transformer model with the traditional BM25 algorithm to achieve both semantic precision and lexical coverage. To generate accurate and comprehensive responses, retrieved passages are synthesized using Large Language Models (LLMs) within a Retrieval Augmented Generation (RAG) framework. Experimental results demonstrate that the hybrid system significantly outperforms standalone lexical and semantic approaches, with notable improvements in Recall@10 and MAP@10. By openly sharing our fine-tuned model and methodology, we aim to advance the development of robust natural language processing tools for compliance-driven applications in regulatory domains.
format Preprint
id arxiv_https___arxiv_org_abs_2502_16767
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle A Hybrid Approach to Information Retrieval and Answer Generation for Regulatory Texts
Rayo, Jhon
de la Rosa, Raul
Garrido, Mario
Computation and Language
Regulatory texts are inherently long and complex, presenting significant challenges for information retrieval systems in supporting regulatory officers with compliance tasks. This paper introduces a hybrid information retrieval system that combines lexical and semantic search techniques to extract relevant information from large regulatory corpora. The system integrates a fine-tuned sentence transformer model with the traditional BM25 algorithm to achieve both semantic precision and lexical coverage. To generate accurate and comprehensive responses, retrieved passages are synthesized using Large Language Models (LLMs) within a Retrieval Augmented Generation (RAG) framework. Experimental results demonstrate that the hybrid system significantly outperforms standalone lexical and semantic approaches, with notable improvements in Recall@10 and MAP@10. By openly sharing our fine-tuned model and methodology, we aim to advance the development of robust natural language processing tools for compliance-driven applications in regulatory domains.
title A Hybrid Approach to Information Retrieval and Answer Generation for Regulatory Texts
topic Computation and Language
url https://arxiv.org/abs/2502.16767