Internformat: :: Library Catalog

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Gugg, Regina, Niederländer, Selina, Stöckl, Andreas, Flechl, Martin
Format:	Preprint
Veröffentlicht:	2026
Schlagworte:	Artificial Intelligence
Online-Zugang:	https://arxiv.org/abs/2605.10639
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

_version_	1866916000894550016
author	Gugg, Regina Niederländer, Selina Stöckl, Andreas Flechl, Martin
author_facet	Gugg, Regina Niederländer, Selina Stöckl, Andreas Flechl, Martin
contents	The rapid adoption of LLMs in both research and industry highlights the challenges of deploying them safely and reveals a gap in the systematic evaluation of toxicity benchmarks. As organizations increasingly rely on these benchmarks to certify models for customer-facing applications and automated moderation, unrecognized evaluation biases could lead to the deployment of vulnerable or unsafe systems. This work investigates the robustness of established benchmarking setups and examines how to measure currently neglected intrinsic biases, such as those related to model choice, metrics, and task types. Our experiments uncover significant discrepancies in benchmark behaviors when evaluation setups are altered. Specifically, shifting the task from text completion to summarization increases the tendency of benchmarks to flag content as harmful. Additionally, certain benchmarks fail to maintain consistent behavior when the input data domain is changed. Furthermore, we observe model-specific instabilities, demonstrating a clear need for more robust and comprehensive safety evaluation frameworks.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_10639
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks Gugg, Regina Niederländer, Selina Stöckl, Andreas Flechl, Martin Artificial Intelligence The rapid adoption of LLMs in both research and industry highlights the challenges of deploying them safely and reveals a gap in the systematic evaluation of toxicity benchmarks. As organizations increasingly rely on these benchmarks to certify models for customer-facing applications and automated moderation, unrecognized evaluation biases could lead to the deployment of vulnerable or unsafe systems. This work investigates the robustness of established benchmarking setups and examines how to measure currently neglected intrinsic biases, such as those related to model choice, metrics, and task types. Our experiments uncover significant discrepancies in benchmark behaviors when evaluation setups are altered. Specifically, shifting the task from text completion to summarization increases the tendency of benchmarks to flag content as harmful. Additionally, certain benchmarks fail to maintain consistent behavior when the input data domain is changed. Furthermore, we observe model-specific instabilities, demonstrating a clear need for more robust and comprehensive safety evaluation frameworks.
title	Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks
topic	Artificial Intelligence
url	https://arxiv.org/abs/2605.10639

Ähnliche Einträge