Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Daniels, Oliver, Moodley, Perusha, Marlin, Benjamin M., Lindner, David
Format:	Preprint
Published:	2026
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2602.08877
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912947077382144
author	Daniels, Oliver Moodley, Perusha Marlin, Benjamin M. Lindner, David
author_facet	Daniels, Oliver Moodley, Perusha Marlin, Benjamin M. Lindner, David
contents	Alignment audits aim to robustly identify hidden goals from strategic, situationally aware misaligned models. Despite this threat model, existing auditing methods have not been systematically stress-tested against deception strategies. We address this gap, implementing an automatic red-team pipeline that generates deception strategies (in the form of system prompts) tailored to specific white-box and black-box auditing methods. Stress-testing assistant prefills, user persona sampling, sparse autoencoders, and token embedding similarity methods against secret-keeping model organisms, our automatic red-team pipeline finds prompts that deceive both the black-box and white-box methods into confident, incorrect guesses. Our results provide the first documented evidence of activation-based strategic deception, and suggest that current black-box and white-box methods would not be robust to a sufficiently capable misaligned model.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_08877
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Stress-Testing Alignment Audits With Prompt-Level Strategic Deception Daniels, Oliver Moodley, Perusha Marlin, Benjamin M. Lindner, David Machine Learning Alignment audits aim to robustly identify hidden goals from strategic, situationally aware misaligned models. Despite this threat model, existing auditing methods have not been systematically stress-tested against deception strategies. We address this gap, implementing an automatic red-team pipeline that generates deception strategies (in the form of system prompts) tailored to specific white-box and black-box auditing methods. Stress-testing assistant prefills, user persona sampling, sparse autoencoders, and token embedding similarity methods against secret-keeping model organisms, our automatic red-team pipeline finds prompts that deceive both the black-box and white-box methods into confident, incorrect guesses. Our results provide the first documented evidence of activation-based strategic deception, and suggest that current black-box and white-box methods would not be robust to a sufficiently capable misaligned model.
title	Stress-Testing Alignment Audits With Prompt-Level Strategic Deception
topic	Machine Learning
url	https://arxiv.org/abs/2602.08877

Similar Items