Affichage MARC: :: Library Catalog

Enregistré dans:

Détails bibliographiques
Auteurs principaux:	Sadeghi, Zahra, Milios, Evangelos, Rudzicz, Frank
Format:	Preprint
Publié:	2025
Sujets:	Computation and Language Artificial Intelligence
Accès en ligne:	https://arxiv.org/abs/2512.19620
Tags:	Ajouter un tag Pas de tags, Soyez le premier à ajouter un tag!

_version_	1866908728356241408
author	Sadeghi, Zahra Milios, Evangelos Rudzicz, Frank
author_facet	Sadeghi, Zahra Milios, Evangelos Rudzicz, Frank
contents	Summary assessment involves evaluating how well a generated summary reflects the key ideas and meaning of the source text, requiring a deep understanding of the content. Large Language Models (LLMs) have been used to automate this process, acting as judges to evaluate summaries with respect to the original text. While previous research investigated the alignment between LLMs and Human responses, it is not yet well understood what properties or features are exploited by them when asked to evaluate based on a particular quality dimension, and there has not been much attention towards mapping between evaluation scores and metrics. In this paper, we address this issue and discover features aligned with Human and Generative Pre-trained Transformers (GPTs) responses by studying statistical and machine learning metrics. Furthermore, we show that instructing GPTs to employ metrics used by Human can improve their judgment and conforming them better with human responses.
format	Preprint
id	arxiv_https___arxiv_org_abs_2512_19620
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Exploring the features used for summary evaluation by Human and GPT Sadeghi, Zahra Milios, Evangelos Rudzicz, Frank Computation and Language Artificial Intelligence Summary assessment involves evaluating how well a generated summary reflects the key ideas and meaning of the source text, requiring a deep understanding of the content. Large Language Models (LLMs) have been used to automate this process, acting as judges to evaluate summaries with respect to the original text. While previous research investigated the alignment between LLMs and Human responses, it is not yet well understood what properties or features are exploited by them when asked to evaluate based on a particular quality dimension, and there has not been much attention towards mapping between evaluation scores and metrics. In this paper, we address this issue and discover features aligned with Human and Generative Pre-trained Transformers (GPTs) responses by studying statistical and machine learning metrics. Furthermore, we show that instructing GPTs to employ metrics used by Human can improve their judgment and conforming them better with human responses.
title	Exploring the features used for summary evaluation by Human and GPT
topic	Computation and Language Artificial Intelligence
url	https://arxiv.org/abs/2512.19620

Documents similaires