Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Uluoglakci, Cem, Temizel, Tugba Taskaya
Format:	Preprint
Published:	2024
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2402.16211
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914692254924800
author	Uluoglakci, Cem Temizel, Tugba Taskaya
author_facet	Uluoglakci, Cem Temizel, Tugba Taskaya
contents	Hallucinations pose a significant challenge to the reliability and alignment of Large Language Models (LLMs), limiting their widespread acceptance beyond chatbot applications. Despite ongoing efforts, hallucinations remain a prevalent challenge in LLMs. The detection of hallucinations itself is also a formidable task, frequently requiring manual labeling or constrained evaluations. This paper introduces an automated scalable framework that combines benchmarking LLMs' hallucination tendencies with efficient hallucination detection. We leverage LLMs to generate challenging tasks related to hypothetical phenomena, subsequently employing them as agents for efficient hallucination detection. The framework is domain-agnostic, allowing the use of any language model for benchmark creation or evaluation in any domain. We introduce the publicly available HypoTermQA Benchmarking Dataset, on which state-of-the-art models' performance ranged between 3% and 11%, and evaluator agents demonstrated a 6% error rate in hallucination prediction. The proposed framework provides opportunities to test and improve LLMs. Additionally, it has the potential to generate benchmarking datasets tailored to specific domains, such as law, health, and finance.
format	Preprint
id	arxiv_https___arxiv_org_abs_2402_16211
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	HypoTermQA: Hypothetical Terms Dataset for Benchmarking Hallucination Tendency of LLMs Uluoglakci, Cem Temizel, Tugba Taskaya Computation and Language Hallucinations pose a significant challenge to the reliability and alignment of Large Language Models (LLMs), limiting their widespread acceptance beyond chatbot applications. Despite ongoing efforts, hallucinations remain a prevalent challenge in LLMs. The detection of hallucinations itself is also a formidable task, frequently requiring manual labeling or constrained evaluations. This paper introduces an automated scalable framework that combines benchmarking LLMs' hallucination tendencies with efficient hallucination detection. We leverage LLMs to generate challenging tasks related to hypothetical phenomena, subsequently employing them as agents for efficient hallucination detection. The framework is domain-agnostic, allowing the use of any language model for benchmark creation or evaluation in any domain. We introduce the publicly available HypoTermQA Benchmarking Dataset, on which state-of-the-art models' performance ranged between 3% and 11%, and evaluator agents demonstrated a 6% error rate in hallucination prediction. The proposed framework provides opportunities to test and improve LLMs. Additionally, it has the potential to generate benchmarking datasets tailored to specific domains, such as law, health, and finance.
title	HypoTermQA: Hypothetical Terms Dataset for Benchmarking Hallucination Tendency of LLMs
topic	Computation and Language
url	https://arxiv.org/abs/2402.16211

Similar Items