Saved in:
Bibliographic Details
Main Authors: Daniels, Oliver, Moodley, Perusha, Marlin, Benjamin M., Lindner, David
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.08877
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866912947077382144
author Daniels, Oliver
Moodley, Perusha
Marlin, Benjamin M.
Lindner, David
author_facet Daniels, Oliver
Moodley, Perusha
Marlin, Benjamin M.
Lindner, David
contents Alignment audits aim to robustly identify hidden goals from strategic, situationally aware misaligned models. Despite this threat model, existing auditing methods have not been systematically stress-tested against deception strategies. We address this gap, implementing an automatic red-team pipeline that generates deception strategies (in the form of system prompts) tailored to specific white-box and black-box auditing methods. Stress-testing assistant prefills, user persona sampling, sparse autoencoders, and token embedding similarity methods against secret-keeping model organisms, our automatic red-team pipeline finds prompts that deceive both the black-box and white-box methods into confident, incorrect guesses. Our results provide the first documented evidence of activation-based strategic deception, and suggest that current black-box and white-box methods would not be robust to a sufficiently capable misaligned model.
format Preprint
id arxiv_https___arxiv_org_abs_2602_08877
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Stress-Testing Alignment Audits With Prompt-Level Strategic Deception
Daniels, Oliver
Moodley, Perusha
Marlin, Benjamin M.
Lindner, David
Machine Learning
Alignment audits aim to robustly identify hidden goals from strategic, situationally aware misaligned models. Despite this threat model, existing auditing methods have not been systematically stress-tested against deception strategies. We address this gap, implementing an automatic red-team pipeline that generates deception strategies (in the form of system prompts) tailored to specific white-box and black-box auditing methods. Stress-testing assistant prefills, user persona sampling, sparse autoencoders, and token embedding similarity methods against secret-keeping model organisms, our automatic red-team pipeline finds prompts that deceive both the black-box and white-box methods into confident, incorrect guesses. Our results provide the first documented evidence of activation-based strategic deception, and suggest that current black-box and white-box methods would not be robust to a sufficiently capable misaligned model.
title Stress-Testing Alignment Audits With Prompt-Level Strategic Deception
topic Machine Learning
url https://arxiv.org/abs/2602.08877