Internformat: :: Library Catalog

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Movva, Rajiv, Koh, Pang Wei, Pierson, Emma
Format:	Preprint
Veröffentlicht:	2024
Schlagworte:	Computation and Language
Online-Zugang:	https://arxiv.org/abs/2406.06369
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

_version_	1866913535259312128
author	Movva, Rajiv Koh, Pang Wei Pierson, Emma
author_facet	Movva, Rajiv Koh, Pang Wei Pierson, Emma
contents	Do LLMs align with human perceptions of safety? We study this question via annotation alignment, the extent to which LLMs and humans agree when annotating the safety of user-chatbot conversations. We leverage the recent DICES dataset (Aroyo et al., 2023), in which 350 conversations are each rated for safety by 112 annotators spanning 10 race-gender groups. GPT-4 achieves a Pearson correlation of $r = 0.59$ with the average annotator rating, \textit{higher} than the median annotator's correlation with the average ($r=0.51$). We show that larger datasets are needed to resolve whether LLMs exhibit disparities in how well they correlate with different demographic groups. Also, there is substantial idiosyncratic variation in correlation within groups, suggesting that race & gender do not fully capture differences in alignment. Finally, we find that GPT-4 cannot predict when one demographic group finds a conversation more unsafe than another.
format	Preprint
id	arxiv_https___arxiv_org_abs_2406_06369
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Annotation alignment: Comparing LLM and human annotations of conversational safety Movva, Rajiv Koh, Pang Wei Pierson, Emma Computation and Language Do LLMs align with human perceptions of safety? We study this question via annotation alignment, the extent to which LLMs and humans agree when annotating the safety of user-chatbot conversations. We leverage the recent DICES dataset (Aroyo et al., 2023), in which 350 conversations are each rated for safety by 112 annotators spanning 10 race-gender groups. GPT-4 achieves a Pearson correlation of $r = 0.59$ with the average annotator rating, \textit{higher} than the median annotator's correlation with the average ($r=0.51$). We show that larger datasets are needed to resolve whether LLMs exhibit disparities in how well they correlate with different demographic groups. Also, there is substantial idiosyncratic variation in correlation within groups, suggesting that race & gender do not fully capture differences in alignment. Finally, we find that GPT-4 cannot predict when one demographic group finds a conversation more unsafe than another.
title	Annotation alignment: Comparing LLM and human annotations of conversational safety
topic	Computation and Language
url	https://arxiv.org/abs/2406.06369

Ähnliche Einträge