Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Li, Yifei, Yue, Xiang, Liao, Zeyi, Sun, Huan
Format:	Preprint
Published:	2024
Subjects:	Computation and Language Artificial Intelligence Machine Learning
Online Access:	https://arxiv.org/abs/2402.15089
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916135599865856
author	Li, Yifei Yue, Xiang Liao, Zeyi Sun, Huan
author_facet	Li, Yifei Yue, Xiang Liao, Zeyi Sun, Huan
contents	Modern generative search engines enhance the reliability of large language model (LLM) responses by providing cited evidence. However, evaluating the answer's attribution, i.e., whether every claim within the generated responses is fully supported by its cited evidence, remains an open problem. This verification, traditionally dependent on costly human evaluation, underscores the urgent need for automatic attribution evaluation methods. To bridge the gap in the absence of standardized benchmarks for these methods, we present AttributionBench, a comprehensive benchmark compiled from various existing attribution datasets. Our extensive experiments on AttributionBench reveal the challenges of automatic attribution evaluation, even for state-of-the-art LLMs. Specifically, our findings show that even a fine-tuned GPT-3.5 only achieves around 80% macro-F1 under a binary classification formulation. A detailed analysis of more than 300 error cases indicates that a majority of failures stem from the model's inability to process nuanced information, and the discrepancy between the information the model has access to and that human annotators do.
format	Preprint
id	arxiv_https___arxiv_org_abs_2402_15089
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	AttributionBench: How Hard is Automatic Attribution Evaluation? Li, Yifei Yue, Xiang Liao, Zeyi Sun, Huan Computation and Language Artificial Intelligence Machine Learning Modern generative search engines enhance the reliability of large language model (LLM) responses by providing cited evidence. However, evaluating the answer's attribution, i.e., whether every claim within the generated responses is fully supported by its cited evidence, remains an open problem. This verification, traditionally dependent on costly human evaluation, underscores the urgent need for automatic attribution evaluation methods. To bridge the gap in the absence of standardized benchmarks for these methods, we present AttributionBench, a comprehensive benchmark compiled from various existing attribution datasets. Our extensive experiments on AttributionBench reveal the challenges of automatic attribution evaluation, even for state-of-the-art LLMs. Specifically, our findings show that even a fine-tuned GPT-3.5 only achieves around 80% macro-F1 under a binary classification formulation. A detailed analysis of more than 300 error cases indicates that a majority of failures stem from the model's inability to process nuanced information, and the discrepancy between the information the model has access to and that human annotators do.
title	AttributionBench: How Hard is Automatic Attribution Evaluation?
topic	Computation and Language Artificial Intelligence Machine Learning
url	https://arxiv.org/abs/2402.15089

Similar Items