Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Erez, Lidor, Hofman, Omer, Nizri, Tamir, Vainshtein, Roman
Format:	Preprint
Published:	2026
Subjects:	Cryptography and Security Performance
Online Access:	https://arxiv.org/abs/2603.14633
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910053963923456
author	Erez, Lidor Hofman, Omer Nizri, Tamir Vainshtein, Roman
author_facet	Erez, Lidor Hofman, Omer Nizri, Tamir Vainshtein, Roman
contents	Automated LLM vulnerability scanners are increasingly used to assess security risks by measuring different attack type success rates (ASR). Yet the validity of these measurements hinges on an often-overlooked component: the evaluator who determines whether an attack has succeeded. In this study, we demonstrate that commonly used open-source scanners exhibit measurement instability that depends on the evaluator component. Consequently, changing the evaluator while keeping the attacks and model outputs constant can significantly alter the reported ASR. To tackle this problem, we present a two-phase, reliability-aware evaluation framework. In the first phase, we quantify evaluator disagreement to identify attack categories where ASR reliability cannot be assumed. In the second phase, we propose a verification-based evaluation method where evaluators are validated by an independent verifier, enabling reliability assessment without relying on extensive human annotation. Applied to the widely used Garak scanner, we observe that 22 of 25 attack categories exhibit evaluator instability, reflected in high disagreement among evaluators. Our approach raises evaluator accuracy from 72% to 89% while enabling selective deployment to control cost and computational overhead. We further quantify evaluator uncertainty in ASR estimates, showing that reported vulnerability scores can vary by up to 33% depending on the evaluator. Our results indicate that the outputs of vulnerability scanners are highly sensitive to the choice of evaluators. Our framework offers a practical approach to quantify unreliable evaluations and enhance the reliability of measurements in automated LLM security assessments.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_14633
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	When Scanners Lie: Evaluator Instability in LLM Red-Teaming Erez, Lidor Hofman, Omer Nizri, Tamir Vainshtein, Roman Cryptography and Security Performance Automated LLM vulnerability scanners are increasingly used to assess security risks by measuring different attack type success rates (ASR). Yet the validity of these measurements hinges on an often-overlooked component: the evaluator who determines whether an attack has succeeded. In this study, we demonstrate that commonly used open-source scanners exhibit measurement instability that depends on the evaluator component. Consequently, changing the evaluator while keeping the attacks and model outputs constant can significantly alter the reported ASR. To tackle this problem, we present a two-phase, reliability-aware evaluation framework. In the first phase, we quantify evaluator disagreement to identify attack categories where ASR reliability cannot be assumed. In the second phase, we propose a verification-based evaluation method where evaluators are validated by an independent verifier, enabling reliability assessment without relying on extensive human annotation. Applied to the widely used Garak scanner, we observe that 22 of 25 attack categories exhibit evaluator instability, reflected in high disagreement among evaluators. Our approach raises evaluator accuracy from 72% to 89% while enabling selective deployment to control cost and computational overhead. We further quantify evaluator uncertainty in ASR estimates, showing that reported vulnerability scores can vary by up to 33% depending on the evaluator. Our results indicate that the outputs of vulnerability scanners are highly sensitive to the choice of evaluators. Our framework offers a practical approach to quantify unreliable evaluations and enhance the reliability of measurements in automated LLM security assessments.
title	When Scanners Lie: Evaluator Instability in LLM Red-Teaming
topic	Cryptography and Security Performance
url	https://arxiv.org/abs/2603.14633

Similar Items