Enregistré dans:
Détails bibliographiques
Auteurs principaux: Sadeghi, Zahra, Milios, Evangelos, Rudzicz, Frank
Format: Preprint
Publié: 2025
Sujets:
Accès en ligne:https://arxiv.org/abs/2512.19620
Tags: Ajouter un tag
Pas de tags, Soyez le premier à ajouter un tag!
_version_ 1866908728356241408
author Sadeghi, Zahra
Milios, Evangelos
Rudzicz, Frank
author_facet Sadeghi, Zahra
Milios, Evangelos
Rudzicz, Frank
contents Summary assessment involves evaluating how well a generated summary reflects the key ideas and meaning of the source text, requiring a deep understanding of the content. Large Language Models (LLMs) have been used to automate this process, acting as judges to evaluate summaries with respect to the original text. While previous research investigated the alignment between LLMs and Human responses, it is not yet well understood what properties or features are exploited by them when asked to evaluate based on a particular quality dimension, and there has not been much attention towards mapping between evaluation scores and metrics. In this paper, we address this issue and discover features aligned with Human and Generative Pre-trained Transformers (GPTs) responses by studying statistical and machine learning metrics. Furthermore, we show that instructing GPTs to employ metrics used by Human can improve their judgment and conforming them better with human responses.
format Preprint
id arxiv_https___arxiv_org_abs_2512_19620
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Exploring the features used for summary evaluation by Human and GPT
Sadeghi, Zahra
Milios, Evangelos
Rudzicz, Frank
Computation and Language
Artificial Intelligence
Summary assessment involves evaluating how well a generated summary reflects the key ideas and meaning of the source text, requiring a deep understanding of the content. Large Language Models (LLMs) have been used to automate this process, acting as judges to evaluate summaries with respect to the original text. While previous research investigated the alignment between LLMs and Human responses, it is not yet well understood what properties or features are exploited by them when asked to evaluate based on a particular quality dimension, and there has not been much attention towards mapping between evaluation scores and metrics. In this paper, we address this issue and discover features aligned with Human and Generative Pre-trained Transformers (GPTs) responses by studying statistical and machine learning metrics. Furthermore, we show that instructing GPTs to employ metrics used by Human can improve their judgment and conforming them better with human responses.
title Exploring the features used for summary evaluation by Human and GPT
topic Computation and Language
Artificial Intelligence
url https://arxiv.org/abs/2512.19620