Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Munshi, Sarthak, Bhatt, Manish, Narajala, Vineeth Sai, Habler, Idan, Al-Kahfah, Ammar, Huang, Ken, Gatto, Blake
Format:	Preprint
Published:	2026
Subjects:	Machine Learning Artificial Intelligence Cryptography and Security
Online Access:	https://arxiv.org/abs/2602.22291
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915982956560384
author	Munshi, Sarthak Bhatt, Manish Narajala, Vineeth Sai Habler, Idan Al-Kahfah, Ammar Huang, Ken Gatto, Blake
author_facet	Munshi, Sarthak Bhatt, Manish Narajala, Vineeth Sai Habler, Idan Al-Kahfah, Ammar Huang, Ken Gatto, Blake
contents	While prior work has focused on projecting adversarial examples back onto the manifold of natural data to restore safety, we argue that a comprehensive understanding of AI safety requires characterizing the unsafe regions themselves. This paper introduces a framework for systematically mapping the Manifold of Failure in Large Language Models (LLMs). We reframe the search for vulnerabilities as a quality diversity problem, using MAP-Elites to illuminate the continuous topology of these failure regions, which we term behavioral attraction basins. Our quality metric, Alignment Deviation, guides the search towards areas where the model's behavior diverges most from its intended alignment. Across three LLMs: Llama-3-8B, GPT-OSS-20B, and GPT-5-Mini, we show that MAP-Elites achieves up to 63% behavioral coverage, discovers up to 370 distinct vulnerability niches, and reveals dramatically different model-specific topological signatures: Llama-3-8B exhibits a near-universal vulnerability plateau (mean Alignment Deviation 0.93), GPT-OSS-20B shows a fragmented landscape with spatially concentrated basins (mean 0.73), and GPT-5-Mini demonstrates strong robustness with a ceiling at 0.50. Our approach produces interpretable, global maps of each model's safety landscape that no existing attack method (GCG, PAIR, or TAP) can provide, shifting the paradigm from finding discrete failures to understanding their underlying structure.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_22291
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Manifold of Failure: Behavioral Attraction Basins in Language Models Munshi, Sarthak Bhatt, Manish Narajala, Vineeth Sai Habler, Idan Al-Kahfah, Ammar Huang, Ken Gatto, Blake Machine Learning Artificial Intelligence Cryptography and Security While prior work has focused on projecting adversarial examples back onto the manifold of natural data to restore safety, we argue that a comprehensive understanding of AI safety requires characterizing the unsafe regions themselves. This paper introduces a framework for systematically mapping the Manifold of Failure in Large Language Models (LLMs). We reframe the search for vulnerabilities as a quality diversity problem, using MAP-Elites to illuminate the continuous topology of these failure regions, which we term behavioral attraction basins. Our quality metric, Alignment Deviation, guides the search towards areas where the model's behavior diverges most from its intended alignment. Across three LLMs: Llama-3-8B, GPT-OSS-20B, and GPT-5-Mini, we show that MAP-Elites achieves up to 63% behavioral coverage, discovers up to 370 distinct vulnerability niches, and reveals dramatically different model-specific topological signatures: Llama-3-8B exhibits a near-universal vulnerability plateau (mean Alignment Deviation 0.93), GPT-OSS-20B shows a fragmented landscape with spatially concentrated basins (mean 0.73), and GPT-5-Mini demonstrates strong robustness with a ceiling at 0.50. Our approach produces interpretable, global maps of each model's safety landscape that no existing attack method (GCG, PAIR, or TAP) can provide, shifting the paradigm from finding discrete failures to understanding their underlying structure.
title	Manifold of Failure: Behavioral Attraction Basins in Language Models
topic	Machine Learning Artificial Intelligence Cryptography and Security
url	https://arxiv.org/abs/2602.22291

Similar Items