Saved in:
Bibliographic Details
Main Author: Mitra, Subhadip
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2606.00801
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866918533722537984
author Mitra, Subhadip
author_facet Mitra, Subhadip
contents Current approaches to LLM adversarial testing suffer from coverage gaps: manual red-teaming does not scale, LLM-as-attacker methods exhibit mode collapse, and gradient-based approaches produce uninterpretable gibberish. We introduce a quality-diversity evolutionary framework that operates at the semantic level, evolving interpretable attack strategies rather than token sequences. Using MAP-Elites, we maintain a diverse archive of attacks across behavioral dimensions (strategy type, encoding method, prompt length). In experiments across GPT-4o-mini, Claude 3.5 Sonnet, Gemini 2.0 Flash, and an open-weight coding model (Devstral-small-2), we discover distinct vulnerability profiles: GPT-4o-mini is vulnerable to hypothetical and multi-turn framing combined with ROT13 encoding (fitness 0.8), Gemini to direct attacks with ROT13 and multi-turn with Leetspeak (0.8), while Claude shows uniformly ambiguous responses across all strategies (max 0.4). The semantic representation produces interpretable attacks that reveal systematic, model-specific weaknesses, providing actionable insights for improving LLM safety and a reproducible baseline for evaluating future frontier models. Code and experiment artifacts are released at https://github.com/bassrehab/red-queen.
format Preprint
id arxiv_https___arxiv_org_abs_2606_00801
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Quality-Diversity Evolution for Discovering Diverse Vulnerabilities in LLM Safety
Mitra, Subhadip
Cryptography and Security
Computation and Language
Emerging Technologies
Machine Learning
Neural and Evolutionary Computing
Current approaches to LLM adversarial testing suffer from coverage gaps: manual red-teaming does not scale, LLM-as-attacker methods exhibit mode collapse, and gradient-based approaches produce uninterpretable gibberish. We introduce a quality-diversity evolutionary framework that operates at the semantic level, evolving interpretable attack strategies rather than token sequences. Using MAP-Elites, we maintain a diverse archive of attacks across behavioral dimensions (strategy type, encoding method, prompt length). In experiments across GPT-4o-mini, Claude 3.5 Sonnet, Gemini 2.0 Flash, and an open-weight coding model (Devstral-small-2), we discover distinct vulnerability profiles: GPT-4o-mini is vulnerable to hypothetical and multi-turn framing combined with ROT13 encoding (fitness 0.8), Gemini to direct attacks with ROT13 and multi-turn with Leetspeak (0.8), while Claude shows uniformly ambiguous responses across all strategies (max 0.4). The semantic representation produces interpretable attacks that reveal systematic, model-specific weaknesses, providing actionable insights for improving LLM safety and a reproducible baseline for evaluating future frontier models. Code and experiment artifacts are released at https://github.com/bassrehab/red-queen.
title Quality-Diversity Evolution for Discovering Diverse Vulnerabilities in LLM Safety
topic Cryptography and Security
Computation and Language
Emerging Technologies
Machine Learning
Neural and Evolutionary Computing
url https://arxiv.org/abs/2606.00801