MARC21: :: Library Catalog

Salvato in:

Dettagli Bibliografici
Autore principale:	Broadwater, Keita
Natura:	Preprint
Pubblicazione:	2026
Soggetti:	Machine Learning Artificial Intelligence
Accesso online:	https://arxiv.org/abs/2602.11786
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

_version_	1866913068756238336
author	Broadwater, Keita
author_facet	Broadwater, Keita
contents	Traditional benchmarks for large language models (LLMs), such as HELM and AIR-BENCH, primarily assess safety through breadth-oriented evaluation across diverse tasks and risk categories. However, real-world deployment often exposes a different class of risk: operational failures that arise under repeated inference on identical or near-identical prompts rather than from broad task-level underperformance. In high-stakes settings, response consistency and safety under sustained use are therefore critical. We introduce Accelerated Prompt Stress Testing (APST), a depth-oriented evaluation framework inspired by highly accelerated stress testing in reliability engineering. APST repeatedly samples identical prompts under controlled operational conditions (such as decoding temperature) to surface latent failure modes including hallucinations, refusal inconsistency, and unsafe completions. Rather than treating failures as isolated events, APST models them as stochastic outcomes of repeated inference and uses Bernoulli and binomial formulations to estimate per-inference failure probabilities. Applying APST to multiple instruction-tuned LLMs evaluated on AIR-BENCH 2024--derived safety and security prompts, we find that models with comparable shallow-evaluation scores can exhibit substantially different empirical failure rates under repeated sampling. These results show that single-sample or low-depth evaluation can obscure meaningful differences in deployment-relevant reliability. APST complements existing benchmark methodologies by providing a practical framework for estimating failure frequency under sustained use and comparing safety reliability across models and decoding configurations.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_11786
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Evaluating LLM Safety Under Repeated Inference via Accelerated Prompt Stress Testing Broadwater, Keita Machine Learning Artificial Intelligence Traditional benchmarks for large language models (LLMs), such as HELM and AIR-BENCH, primarily assess safety through breadth-oriented evaluation across diverse tasks and risk categories. However, real-world deployment often exposes a different class of risk: operational failures that arise under repeated inference on identical or near-identical prompts rather than from broad task-level underperformance. In high-stakes settings, response consistency and safety under sustained use are therefore critical. We introduce Accelerated Prompt Stress Testing (APST), a depth-oriented evaluation framework inspired by highly accelerated stress testing in reliability engineering. APST repeatedly samples identical prompts under controlled operational conditions (such as decoding temperature) to surface latent failure modes including hallucinations, refusal inconsistency, and unsafe completions. Rather than treating failures as isolated events, APST models them as stochastic outcomes of repeated inference and uses Bernoulli and binomial formulations to estimate per-inference failure probabilities. Applying APST to multiple instruction-tuned LLMs evaluated on AIR-BENCH 2024--derived safety and security prompts, we find that models with comparable shallow-evaluation scores can exhibit substantially different empirical failure rates under repeated sampling. These results show that single-sample or low-depth evaluation can obscure meaningful differences in deployment-relevant reliability. APST complements existing benchmark methodologies by providing a practical framework for estimating failure frequency under sustained use and comparing safety reliability across models and decoding configurations.
title	Evaluating LLM Safety Under Repeated Inference via Accelerated Prompt Stress Testing
topic	Machine Learning Artificial Intelligence
url	https://arxiv.org/abs/2602.11786

Documenti analoghi