Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Fan, Yuchen, Lin, Chen, Zhong, Xin, Zhang, Shuo, Zhou, Heng, Zhang, Yuchen, Liang, Mingyu, Xie, Chengxing, Hua, Ermo, Chen, Gang, He, Zhizhou, Huang, Cheng, Ding, Ning, Zhou, Bowen
Format: Preprint
Veröffentlicht: 2024
Schlagworte:
Online-Zugang:https://arxiv.org/abs/2410.01945
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
_version_ 1866910006876569600
author Fan, Yuchen
Lin, Chen
Zhong, Xin
Zhang, Shuo
Zhou, Heng
Zhang, Yuchen
Liang, Mingyu
Xie, Chengxing
Hua, Ermo
Chen, Gang
He, Zhizhou
Huang, Cheng
Ding, Ning
Zhou, Bowen
author_facet Fan, Yuchen
Lin, Chen
Zhong, Xin
Zhang, Shuo
Zhou, Heng
Zhang, Yuchen
Liang, Mingyu
Xie, Chengxing
Hua, Ermo
Chen, Gang
He, Zhizhou
Huang, Cheng
Ding, Ning
Zhou, Bowen
contents Long-Form Question Answering (LFQA) involves generating comprehensive, paragraph-level responses to open-ended questions, which poses a significant challenge for evaluation due to the richness of information and flexible response format. Existing LFQA-evaluation benchmarks often lack reference answers and are limited in size and topic coverage, reducing their reliability. To address this gap, we introduce LFQA-E, a well-constructed, multilingual, and reference-based benchmark designed to rigorously evaluate automatic metrics for LFQA. LFQA-E comprises 1618 questions and 7323 pairwise comparisons across 15 topics, drawn from diverse sources such as online queries and examination questions, thereby enabling a comprehensive assessment of evaluation metrics. We examine five categories of metrics, encompassing 17 specific methods, using LFQA-E. The results demonstrate that none of the existing automatic metrics perform comparably to human judgments, highlighting their inability to capture the dense information in long-form responses. Furthermore, we present a detailed analysis of the failure cases and the generalization capacity of these metrics, offering insights to guide the future development of LFQA evaluation methods. The benchmark and code are available at https://github.com/YuchenFan48/LFQA-E.
format Preprint
id arxiv_https___arxiv_org_abs_2410_01945
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle LFQA-E: Carefully Benchmarking Long-form QA Evaluation
Fan, Yuchen
Lin, Chen
Zhong, Xin
Zhang, Shuo
Zhou, Heng
Zhang, Yuchen
Liang, Mingyu
Xie, Chengxing
Hua, Ermo
Chen, Gang
He, Zhizhou
Huang, Cheng
Ding, Ning
Zhou, Bowen
Computation and Language
Long-Form Question Answering (LFQA) involves generating comprehensive, paragraph-level responses to open-ended questions, which poses a significant challenge for evaluation due to the richness of information and flexible response format. Existing LFQA-evaluation benchmarks often lack reference answers and are limited in size and topic coverage, reducing their reliability. To address this gap, we introduce LFQA-E, a well-constructed, multilingual, and reference-based benchmark designed to rigorously evaluate automatic metrics for LFQA. LFQA-E comprises 1618 questions and 7323 pairwise comparisons across 15 topics, drawn from diverse sources such as online queries and examination questions, thereby enabling a comprehensive assessment of evaluation metrics. We examine five categories of metrics, encompassing 17 specific methods, using LFQA-E. The results demonstrate that none of the existing automatic metrics perform comparably to human judgments, highlighting their inability to capture the dense information in long-form responses. Furthermore, we present a detailed analysis of the failure cases and the generalization capacity of these metrics, offering insights to guide the future development of LFQA evaluation methods. The benchmark and code are available at https://github.com/YuchenFan48/LFQA-E.
title LFQA-E: Carefully Benchmarking Long-form QA Evaluation
topic Computation and Language
url https://arxiv.org/abs/2410.01945