MARC21: :: Library Catalog

Salvato in:

Dettagli Bibliografici
Autori principali:	Mangold, Aline, Hoffmann, Kiran
Natura:	Preprint
Pubblicazione:	2025
Soggetti:	Artificial Intelligence
Accesso online:	https://arxiv.org/abs/2509.26205
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

_version_	1866915525114724352
author	Mangold, Aline Hoffmann, Kiran
author_facet	Mangold, Aline Hoffmann, Kiran
contents	Retrieval-augmented generation (RAG) systems are increasingly deployed in user-facing applications, yet systematic, human-centered evaluation of their outputs remains underexplored. Building on Gienapp's utility-dimension framework, we designed a human-centred questionnaire that assesses RAG outputs across 12 dimensions. We iteratively refined the questionnaire through several rounds of ratings on a set of query-output pairs and semantic discussions. Ultimately, we incorporated feedback from both a human rater and a human-LLM pair. Results indicate that while large language models (LLMs) reliably focus on metric descriptions and scale labels, they exhibit weaknesses in detecting textual format variations. Humans struggled to focus strictly on metric descriptions and labels. LLM ratings and explanations were viewed as a helpful support, but numeric LLM and human ratings lacked agreement. The final questionnaire extends the initial framework by focusing on user intent, text structuring, and information verifiability.
format	Preprint
id	arxiv_https___arxiv_org_abs_2509_26205
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Human-Centered Evaluation of RAG outputs: a framework and questionnaire for human-AI collaboration Mangold, Aline Hoffmann, Kiran Artificial Intelligence Retrieval-augmented generation (RAG) systems are increasingly deployed in user-facing applications, yet systematic, human-centered evaluation of their outputs remains underexplored. Building on Gienapp's utility-dimension framework, we designed a human-centred questionnaire that assesses RAG outputs across 12 dimensions. We iteratively refined the questionnaire through several rounds of ratings on a set of query-output pairs and semantic discussions. Ultimately, we incorporated feedback from both a human rater and a human-LLM pair. Results indicate that while large language models (LLMs) reliably focus on metric descriptions and scale labels, they exhibit weaknesses in detecting textual format variations. Humans struggled to focus strictly on metric descriptions and labels. LLM ratings and explanations were viewed as a helpful support, but numeric LLM and human ratings lacked agreement. The final questionnaire extends the initial framework by focusing on user intent, text structuring, and information verifiability.
title	Human-Centered Evaluation of RAG outputs: a framework and questionnaire for human-AI collaboration
topic	Artificial Intelligence
url	https://arxiv.org/abs/2509.26205

Documenti analoghi