Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Jiang, Dongfu, Li, Yishan, Zhang, Ge, Huang, Wenhao, Lin, Bill Yuchen, Chen, Wenhu
Format:	Preprint
Published:	2023
Subjects:	Computation and Language Artificial Intelligence
Online Access:	https://arxiv.org/abs/2310.00752
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910441026879488
author	Jiang, Dongfu Li, Yishan Zhang, Ge Huang, Wenhao Lin, Bill Yuchen Chen, Wenhu
author_facet	Jiang, Dongfu Li, Yishan Zhang, Ge Huang, Wenhao Lin, Bill Yuchen Chen, Wenhu
contents	We present TIGERScore, a \textbf{T}rained metric that follows \textbf{I}nstruction \textbf{G}uidance to perform \textbf{E}xplainable, and \textbf{R}eference-free evaluation over a wide spectrum of text generation tasks. Different from other automatic evaluation methods that only provide arcane scores, TIGERScore is guided by natural language instruction to provide error analysis to pinpoint the mistakes in the generated text. Our metric is based on LLaMA-2, trained on our meticulously curated instruction-tuning dataset MetricInstruct which covers 6 text generation tasks and 23 text generation datasets. The dataset consists of 42K quadruple in the form of (instruction, input, system output $\rightarrow$ error analysis). We collected the `system outputs' through from a large variety of models to cover different types of errors. To quantitatively assess our metric, we evaluate its correlation with human ratings on 5 held-in datasets, 2 held-out datasets and show that TIGERScore can achieve the open-source SoTA correlation with human ratings across these datasets and almost approaches GPT-4 evaluator. As a reference-free metric, its correlation can even surpass the best existing reference-based metrics. To further qualitatively assess the rationale generated by our metric, we conduct human evaluation on the generated explanations and found that the explanations are 70.8\% accurate. Through these experimental results, we believe TIGERScore demonstrates the possibility of building universal explainable metrics to evaluate any text generation task. All the resourced are released in our project website: \url{https://tiger-ai-lab.github.io/TIGERScore/}.
format	Preprint
id	arxiv_https___arxiv_org_abs_2310_00752
institution	arXiv
publishDate	2023
record_format	arxiv
spellingShingle	TIGERScore: Towards Building Explainable Metric for All Text Generation Tasks Jiang, Dongfu Li, Yishan Zhang, Ge Huang, Wenhao Lin, Bill Yuchen Chen, Wenhu Computation and Language Artificial Intelligence We present TIGERScore, a \textbf{T}rained metric that follows \textbf{I}nstruction \textbf{G}uidance to perform \textbf{E}xplainable, and \textbf{R}eference-free evaluation over a wide spectrum of text generation tasks. Different from other automatic evaluation methods that only provide arcane scores, TIGERScore is guided by natural language instruction to provide error analysis to pinpoint the mistakes in the generated text. Our metric is based on LLaMA-2, trained on our meticulously curated instruction-tuning dataset MetricInstruct which covers 6 text generation tasks and 23 text generation datasets. The dataset consists of 42K quadruple in the form of (instruction, input, system output $\rightarrow$ error analysis). We collected the `system outputs' through from a large variety of models to cover different types of errors. To quantitatively assess our metric, we evaluate its correlation with human ratings on 5 held-in datasets, 2 held-out datasets and show that TIGERScore can achieve the open-source SoTA correlation with human ratings across these datasets and almost approaches GPT-4 evaluator. As a reference-free metric, its correlation can even surpass the best existing reference-based metrics. To further qualitatively assess the rationale generated by our metric, we conduct human evaluation on the generated explanations and found that the explanations are 70.8\% accurate. Through these experimental results, we believe TIGERScore demonstrates the possibility of building universal explainable metrics to evaluate any text generation task. All the resourced are released in our project website: \url{https://tiger-ai-lab.github.io/TIGERScore/}.
title	TIGERScore: Towards Building Explainable Metric for All Text Generation Tasks
topic	Computation and Language Artificial Intelligence
url	https://arxiv.org/abs/2310.00752

Similar Items