Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Rayo, Jhon, de la Rosa, Raul, Garrido, Mario
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2502.16767
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916627214237696
author	Rayo, Jhon de la Rosa, Raul Garrido, Mario
author_facet	Rayo, Jhon de la Rosa, Raul Garrido, Mario
contents	Regulatory texts are inherently long and complex, presenting significant challenges for information retrieval systems in supporting regulatory officers with compliance tasks. This paper introduces a hybrid information retrieval system that combines lexical and semantic search techniques to extract relevant information from large regulatory corpora. The system integrates a fine-tuned sentence transformer model with the traditional BM25 algorithm to achieve both semantic precision and lexical coverage. To generate accurate and comprehensive responses, retrieved passages are synthesized using Large Language Models (LLMs) within a Retrieval Augmented Generation (RAG) framework. Experimental results demonstrate that the hybrid system significantly outperforms standalone lexical and semantic approaches, with notable improvements in Recall@10 and MAP@10. By openly sharing our fine-tuned model and methodology, we aim to advance the development of robust natural language processing tools for compliance-driven applications in regulatory domains.
format	Preprint
id	arxiv_https___arxiv_org_abs_2502_16767
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	A Hybrid Approach to Information Retrieval and Answer Generation for Regulatory Texts Rayo, Jhon de la Rosa, Raul Garrido, Mario Computation and Language Regulatory texts are inherently long and complex, presenting significant challenges for information retrieval systems in supporting regulatory officers with compliance tasks. This paper introduces a hybrid information retrieval system that combines lexical and semantic search techniques to extract relevant information from large regulatory corpora. The system integrates a fine-tuned sentence transformer model with the traditional BM25 algorithm to achieve both semantic precision and lexical coverage. To generate accurate and comprehensive responses, retrieved passages are synthesized using Large Language Models (LLMs) within a Retrieval Augmented Generation (RAG) framework. Experimental results demonstrate that the hybrid system significantly outperforms standalone lexical and semantic approaches, with notable improvements in Recall@10 and MAP@10. By openly sharing our fine-tuned model and methodology, we aim to advance the development of robust natural language processing tools for compliance-driven applications in regulatory domains.
title	A Hybrid Approach to Information Retrieval and Answer Generation for Regulatory Texts
topic	Computation and Language
url	https://arxiv.org/abs/2502.16767

Similar Items