Saved in:
Bibliographic Details
Main Author: Priola, Maria Paola
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2412.04235
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911345589354496
author Priola, Maria Paola
author_facet Priola, Maria Paola
contents I combine detection and mitigation techniques to addresses hallucinations in Large Language Models (LLMs). Mitigation is achieved in a question-answering Retrieval-Augmented Generation (RAG) framework while detection is obtained by introducing the Negative Missing Information Scoring System (NMISS), which accounts for contextual relevance in responses. While RAG mitigates hallucinations by grounding answers in external data, NMISS refines the evaluation by identifying cases where traditional metrics incorrectly flag contextually accurate responses as hallucinations. I use Italian health news articles as context to evaluate LLM performance. Results show that Gemma2 and GPT-4 outperform the other models, with GPT-4 producing answers closely aligned with reference responses. Mid-tier models, such as Llama2, Llama3, and Mistral benefit significantly from NMISS, highlighting their ability to provide richer contextual information. This combined approach offers new insights into the reduction and more accurate assessment of hallucinations in LLMs, with applications in real-world healthcare tasks and other domains.
format Preprint
id arxiv_https___arxiv_org_abs_2412_04235
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Addressing Hallucinations with RAG and NMISS in Italian Healthcare LLM Chatbots
Priola, Maria Paola
Computation and Language
I combine detection and mitigation techniques to addresses hallucinations in Large Language Models (LLMs). Mitigation is achieved in a question-answering Retrieval-Augmented Generation (RAG) framework while detection is obtained by introducing the Negative Missing Information Scoring System (NMISS), which accounts for contextual relevance in responses. While RAG mitigates hallucinations by grounding answers in external data, NMISS refines the evaluation by identifying cases where traditional metrics incorrectly flag contextually accurate responses as hallucinations. I use Italian health news articles as context to evaluate LLM performance. Results show that Gemma2 and GPT-4 outperform the other models, with GPT-4 producing answers closely aligned with reference responses. Mid-tier models, such as Llama2, Llama3, and Mistral benefit significantly from NMISS, highlighting their ability to provide richer contextual information. This combined approach offers new insights into the reduction and more accurate assessment of hallucinations in LLMs, with applications in real-world healthcare tasks and other domains.
title Addressing Hallucinations with RAG and NMISS in Italian Healthcare LLM Chatbots
topic Computation and Language
url https://arxiv.org/abs/2412.04235