Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Author:	Mitra, Subhadip
Format:	Preprint
Published:	2026
Subjects:	Cryptography and Security Computation and Language Emerging Technologies Machine Learning Neural and Evolutionary Computing
Online Access:	https://arxiv.org/abs/2606.00801
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866918533722537984
author	Mitra, Subhadip
author_facet	Mitra, Subhadip
contents	Current approaches to LLM adversarial testing suffer from coverage gaps: manual red-teaming does not scale, LLM-as-attacker methods exhibit mode collapse, and gradient-based approaches produce uninterpretable gibberish. We introduce a quality-diversity evolutionary framework that operates at the semantic level, evolving interpretable attack strategies rather than token sequences. Using MAP-Elites, we maintain a diverse archive of attacks across behavioral dimensions (strategy type, encoding method, prompt length). In experiments across GPT-4o-mini, Claude 3.5 Sonnet, Gemini 2.0 Flash, and an open-weight coding model (Devstral-small-2), we discover distinct vulnerability profiles: GPT-4o-mini is vulnerable to hypothetical and multi-turn framing combined with ROT13 encoding (fitness 0.8), Gemini to direct attacks with ROT13 and multi-turn with Leetspeak (0.8), while Claude shows uniformly ambiguous responses across all strategies (max 0.4). The semantic representation produces interpretable attacks that reveal systematic, model-specific weaknesses, providing actionable insights for improving LLM safety and a reproducible baseline for evaluating future frontier models. Code and experiment artifacts are released at https://github.com/bassrehab/red-queen.
format	Preprint
id	arxiv_https___arxiv_org_abs_2606_00801
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Quality-Diversity Evolution for Discovering Diverse Vulnerabilities in LLM Safety Mitra, Subhadip Cryptography and Security Computation and Language Emerging Technologies Machine Learning Neural and Evolutionary Computing Current approaches to LLM adversarial testing suffer from coverage gaps: manual red-teaming does not scale, LLM-as-attacker methods exhibit mode collapse, and gradient-based approaches produce uninterpretable gibberish. We introduce a quality-diversity evolutionary framework that operates at the semantic level, evolving interpretable attack strategies rather than token sequences. Using MAP-Elites, we maintain a diverse archive of attacks across behavioral dimensions (strategy type, encoding method, prompt length). In experiments across GPT-4o-mini, Claude 3.5 Sonnet, Gemini 2.0 Flash, and an open-weight coding model (Devstral-small-2), we discover distinct vulnerability profiles: GPT-4o-mini is vulnerable to hypothetical and multi-turn framing combined with ROT13 encoding (fitness 0.8), Gemini to direct attacks with ROT13 and multi-turn with Leetspeak (0.8), while Claude shows uniformly ambiguous responses across all strategies (max 0.4). The semantic representation produces interpretable attacks that reveal systematic, model-specific weaknesses, providing actionable insights for improving LLM safety and a reproducible baseline for evaluating future frontier models. Code and experiment artifacts are released at https://github.com/bassrehab/red-queen.
title	Quality-Diversity Evolution for Discovering Diverse Vulnerabilities in LLM Safety
topic	Cryptography and Security Computation and Language Emerging Technologies Machine Learning Neural and Evolutionary Computing
url	https://arxiv.org/abs/2606.00801

Similar Items