Saved in:
Bibliographic Details
Main Authors: Erez, Lidor, Hofman, Omer, Nizri, Tamir, Vainshtein, Roman
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2603.14633
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866910053963923456
author Erez, Lidor
Hofman, Omer
Nizri, Tamir
Vainshtein, Roman
author_facet Erez, Lidor
Hofman, Omer
Nizri, Tamir
Vainshtein, Roman
contents Automated LLM vulnerability scanners are increasingly used to assess security risks by measuring different attack type success rates (ASR). Yet the validity of these measurements hinges on an often-overlooked component: the evaluator who determines whether an attack has succeeded. In this study, we demonstrate that commonly used open-source scanners exhibit measurement instability that depends on the evaluator component. Consequently, changing the evaluator while keeping the attacks and model outputs constant can significantly alter the reported ASR. To tackle this problem, we present a two-phase, reliability-aware evaluation framework. In the first phase, we quantify evaluator disagreement to identify attack categories where ASR reliability cannot be assumed. In the second phase, we propose a verification-based evaluation method where evaluators are validated by an independent verifier, enabling reliability assessment without relying on extensive human annotation. Applied to the widely used Garak scanner, we observe that 22 of 25 attack categories exhibit evaluator instability, reflected in high disagreement among evaluators. Our approach raises evaluator accuracy from 72% to 89% while enabling selective deployment to control cost and computational overhead. We further quantify evaluator uncertainty in ASR estimates, showing that reported vulnerability scores can vary by up to 33% depending on the evaluator. Our results indicate that the outputs of vulnerability scanners are highly sensitive to the choice of evaluators. Our framework offers a practical approach to quantify unreliable evaluations and enhance the reliability of measurements in automated LLM security assessments.
format Preprint
id arxiv_https___arxiv_org_abs_2603_14633
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle When Scanners Lie: Evaluator Instability in LLM Red-Teaming
Erez, Lidor
Hofman, Omer
Nizri, Tamir
Vainshtein, Roman
Cryptography and Security
Performance
Automated LLM vulnerability scanners are increasingly used to assess security risks by measuring different attack type success rates (ASR). Yet the validity of these measurements hinges on an often-overlooked component: the evaluator who determines whether an attack has succeeded. In this study, we demonstrate that commonly used open-source scanners exhibit measurement instability that depends on the evaluator component. Consequently, changing the evaluator while keeping the attacks and model outputs constant can significantly alter the reported ASR. To tackle this problem, we present a two-phase, reliability-aware evaluation framework. In the first phase, we quantify evaluator disagreement to identify attack categories where ASR reliability cannot be assumed. In the second phase, we propose a verification-based evaluation method where evaluators are validated by an independent verifier, enabling reliability assessment without relying on extensive human annotation. Applied to the widely used Garak scanner, we observe that 22 of 25 attack categories exhibit evaluator instability, reflected in high disagreement among evaluators. Our approach raises evaluator accuracy from 72% to 89% while enabling selective deployment to control cost and computational overhead. We further quantify evaluator uncertainty in ASR estimates, showing that reported vulnerability scores can vary by up to 33% depending on the evaluator. Our results indicate that the outputs of vulnerability scanners are highly sensitive to the choice of evaluators. Our framework offers a practical approach to quantify unreliable evaluations and enhance the reliability of measurements in automated LLM security assessments.
title When Scanners Lie: Evaluator Instability in LLM Red-Teaming
topic Cryptography and Security
Performance
url https://arxiv.org/abs/2603.14633