Internformat: :: Library Catalog

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Bertram, Johannes, Geiping, Jonas
Format:	Preprint
Veröffentlicht:	2026
Schlagworte:	Cryptography and Security Software Engineering
Online-Zugang:	https://arxiv.org/abs/2602.16756
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

_version_	1866914338330116096
author	Bertram, Johannes Geiping, Jonas
author_facet	Bertram, Johannes Geiping, Jonas
contents	We introduce NESSiE, the NEceSsary SafEty benchmark for large language models (LLMs). With minimal test cases of information and access security, NESSiE reveals safety-relevant failures that should not exist, given the low complexity of the tasks. NESSiE is intended as a lightweight, easy-to-use sanity check for language model safety and, as such, is not sufficient for guaranteeing safety in general -- but we argue that passing this test is necessary for any deployment. However, even state-of-the-art LLMs do not reach 100% on NESSiE and thus fail our necessary condition of language model safety, even in the absence of adversarial attacks. Our Safe & Helpful (SH) metric allows for direct comparison of the two requirements, showing models are biased toward being helpful rather than safe. We further find that disabled reasoning for some models, but especially a benign distraction context degrade model performance. Overall, our results underscore the critical risks of deploying such models as autonomous agents in the wild. We make the dataset, package and plotting code publicly available.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_16756
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	NESSiE: The Necessary Safety Benchmark -- Identifying Errors that should not Exist Bertram, Johannes Geiping, Jonas Cryptography and Security Software Engineering We introduce NESSiE, the NEceSsary SafEty benchmark for large language models (LLMs). With minimal test cases of information and access security, NESSiE reveals safety-relevant failures that should not exist, given the low complexity of the tasks. NESSiE is intended as a lightweight, easy-to-use sanity check for language model safety and, as such, is not sufficient for guaranteeing safety in general -- but we argue that passing this test is necessary for any deployment. However, even state-of-the-art LLMs do not reach 100% on NESSiE and thus fail our necessary condition of language model safety, even in the absence of adversarial attacks. Our Safe & Helpful (SH) metric allows for direct comparison of the two requirements, showing models are biased toward being helpful rather than safe. We further find that disabled reasoning for some models, but especially a benign distraction context degrade model performance. Overall, our results underscore the critical risks of deploying such models as autonomous agents in the wild. We make the dataset, package and plotting code publicly available.
title	NESSiE: The Necessary Safety Benchmark -- Identifying Errors that should not Exist
topic	Cryptography and Security Software Engineering
url	https://arxiv.org/abs/2602.16756

Ähnliche Einträge