Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Hackl, Veronika, Müller, Alexandra Elena, Granitzer, Michael, Sailer, Maximilian
Format:	Preprint
Published:	2023
Subjects:	Computation and Language Artificial Intelligence
Online Access:	https://arxiv.org/abs/2308.02575
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911760021192704
author	Hackl, Veronika Müller, Alexandra Elena Granitzer, Michael Sailer, Maximilian
author_facet	Hackl, Veronika Müller, Alexandra Elena Granitzer, Michael Sailer, Maximilian
contents	This study investigates the consistency of feedback ratings generated by OpenAI's GPT-4, a state-of-the-art artificial intelligence language model, across multiple iterations, time spans and stylistic variations. The model rated responses to tasks within the Higher Education (HE) subject domain of macroeconomics in terms of their content and style. Statistical analysis was conducted in order to learn more about the interrater reliability, consistency of the ratings across iterations and the correlation between ratings in terms of content and style. The results revealed a high interrater reliability with ICC scores ranging between 0.94 and 0.99 for different timespans, suggesting that GPT-4 is capable of generating consistent ratings across repetitions with a clear prompt. Style and content ratings show a high correlation of 0.87. When applying a non-adequate style the average content ratings remained constant, while style ratings decreased, which indicates that the large language model (LLM) effectively distinguishes between these two criteria during evaluation. The prompt used in this study is furthermore presented and explained. Further research is necessary to assess the robustness and reliability of AI models in various use cases.
format	Preprint
id	arxiv_https___arxiv_org_abs_2308_02575
institution	arXiv
publishDate	2023
record_format	arxiv
spellingShingle	Is GPT-4 a reliable rater? Evaluating Consistency in GPT-4 Text Ratings Hackl, Veronika Müller, Alexandra Elena Granitzer, Michael Sailer, Maximilian Computation and Language Artificial Intelligence This study investigates the consistency of feedback ratings generated by OpenAI's GPT-4, a state-of-the-art artificial intelligence language model, across multiple iterations, time spans and stylistic variations. The model rated responses to tasks within the Higher Education (HE) subject domain of macroeconomics in terms of their content and style. Statistical analysis was conducted in order to learn more about the interrater reliability, consistency of the ratings across iterations and the correlation between ratings in terms of content and style. The results revealed a high interrater reliability with ICC scores ranging between 0.94 and 0.99 for different timespans, suggesting that GPT-4 is capable of generating consistent ratings across repetitions with a clear prompt. Style and content ratings show a high correlation of 0.87. When applying a non-adequate style the average content ratings remained constant, while style ratings decreased, which indicates that the large language model (LLM) effectively distinguishes between these two criteria during evaluation. The prompt used in this study is furthermore presented and explained. Further research is necessary to assess the robustness and reliability of AI models in various use cases.
title	Is GPT-4 a reliable rater? Evaluating Consistency in GPT-4 Text Ratings
topic	Computation and Language Artificial Intelligence
url	https://arxiv.org/abs/2308.02575

Similar Items