Table of Contents: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Dietz, Florian, Wale, William, Gilg, Oscar, McCarthy, Robert, Michalak, Felix, Danon, Gustavo Ewbank Rodrigues, de Guzman, Miguelito, Klakow, Dietrich
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence Machine Learning
Online Access:	https://arxiv.org/abs/2602.05532
Tags:	Add Tag No Tags, Be the first to tag this record!

Table of Contents:

Detecting misalignment in large language models is challenging because models may learn to conceal misbehavior during training. Standard auditing techniques fall short: black-box methods often cannot distinguish misaligned outputs from benign ones, and mechanistic interpretability does not scale with model capabilities. We introduce Split Personality Training (SPT), which fine-tunes a second ``honest persona'' into LoRA parameters that remain inactive during normal operation. After the main model responds, we activate the LoRA adapter and insert a trigger string, enabling the honest persona to review the response while accessing the main model's latent states. We test our method on the Anthropic Auditing Game Model Organism, a benchmark where Llama-3.3-70B is trained to exploit reward hacks while concealing this behavior. SPT achieves 96% overall accuracy, whereas Anthropic reports near 0% accuracy. The honest persona reveals latent knowledge inaccessible to external observers, such as the fictional biases the compromised model was trained on.

Similar Items