Saved in:
Bibliographic Details
Main Authors: Wang, Yuxia, Reddy, Revanth Gangi, Mujahid, Zain Muhammad, Arora, Arnav, Rubashevskii, Aleksandr, Geng, Jiahui, Afzal, Osama Mohammed, Pan, Liangming, Borenstein, Nadav, Pillai, Aditya, Augenstein, Isabelle, Gurevych, Iryna, Nakov, Preslav
Format: Preprint
Published: 2023
Subjects:
Online Access:https://arxiv.org/abs/2311.09000
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866913316356489216
author Wang, Yuxia
Reddy, Revanth Gangi
Mujahid, Zain Muhammad
Arora, Arnav
Rubashevskii, Aleksandr
Geng, Jiahui
Afzal, Osama Mohammed
Pan, Liangming
Borenstein, Nadav
Pillai, Aditya
Augenstein, Isabelle
Gurevych, Iryna
Nakov, Preslav
author_facet Wang, Yuxia
Reddy, Revanth Gangi
Mujahid, Zain Muhammad
Arora, Arnav
Rubashevskii, Aleksandr
Geng, Jiahui
Afzal, Osama Mohammed
Pan, Liangming
Borenstein, Nadav
Pillai, Aditya
Augenstein, Isabelle
Gurevych, Iryna
Nakov, Preslav
contents The increased use of large language models (LLMs) across a variety of real-world applications calls for mechanisms to verify the factual accuracy of their outputs. In this work, we present a holistic end-to-end solution for annotating the factuality of LLM-generated responses, which encompasses a multi-stage annotation scheme designed to yield detailed labels concerning the verifiability and factual inconsistencies found in LLM outputs. We further construct an open-domain document-level factuality benchmark in three-level granularity: claim, sentence and document, aiming to facilitate the evaluation of automatic fact-checking systems. Preliminary experiments show that FacTool, FactScore and Perplexity.ai are struggling to identify false claims, with the best F1=0.63 by this annotation solution based on GPT-4. Annotation tool, benchmark and code are available at https://github.com/yuxiaw/Factcheck-GPT.
format Preprint
id arxiv_https___arxiv_org_abs_2311_09000
institution arXiv
publishDate 2023
record_format arxiv
spellingShingle Factcheck-Bench: Fine-Grained Evaluation Benchmark for Automatic Fact-checkers
Wang, Yuxia
Reddy, Revanth Gangi
Mujahid, Zain Muhammad
Arora, Arnav
Rubashevskii, Aleksandr
Geng, Jiahui
Afzal, Osama Mohammed
Pan, Liangming
Borenstein, Nadav
Pillai, Aditya
Augenstein, Isabelle
Gurevych, Iryna
Nakov, Preslav
Computation and Language
The increased use of large language models (LLMs) across a variety of real-world applications calls for mechanisms to verify the factual accuracy of their outputs. In this work, we present a holistic end-to-end solution for annotating the factuality of LLM-generated responses, which encompasses a multi-stage annotation scheme designed to yield detailed labels concerning the verifiability and factual inconsistencies found in LLM outputs. We further construct an open-domain document-level factuality benchmark in three-level granularity: claim, sentence and document, aiming to facilitate the evaluation of automatic fact-checking systems. Preliminary experiments show that FacTool, FactScore and Perplexity.ai are struggling to identify false claims, with the best F1=0.63 by this annotation solution based on GPT-4. Annotation tool, benchmark and code are available at https://github.com/yuxiaw/Factcheck-GPT.
title Factcheck-Bench: Fine-Grained Evaluation Benchmark for Automatic Fact-checkers
topic Computation and Language
url https://arxiv.org/abs/2311.09000