Saved in:
Bibliographic Details
Main Author: Clark, Michael J.
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2601.07473
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911672595120128
author Clark, Michael J.
author_facet Clark, Michael J.
contents As models grow more capable, humans cannot reliably verify what they say. Scalable steering requires methods that are internal, self-supervised, and transfer out-of-distribution; existing methods satisfy some but not all three. We introduce AntiPaSTO, which separates representations along an antiparallel axis (+1/-1 produce opposite shifts), with coherence constraints preventing collapse. Training uses only two contrasting words inserted into template sentences, with no preference labels. When we use 800 such synthetic pairs on Gemma-3-1B, AntiPaSTO beats prompting baselines by 6.9x Steering F1 on DailyDilemmas and wins on 5 of 6 tested value axes. We also find preliminary evidence that it maintains bidirectional control where prompting triggers refusal.
format Preprint
id arxiv_https___arxiv_org_abs_2601_07473
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle AntiPaSTO: Self-Supervised Honesty Steering via Anti-Parallel Representations
Clark, Michael J.
Machine Learning
As models grow more capable, humans cannot reliably verify what they say. Scalable steering requires methods that are internal, self-supervised, and transfer out-of-distribution; existing methods satisfy some but not all three. We introduce AntiPaSTO, which separates representations along an antiparallel axis (+1/-1 produce opposite shifts), with coherence constraints preventing collapse. Training uses only two contrasting words inserted into template sentences, with no preference labels. When we use 800 such synthetic pairs on Gemma-3-1B, AntiPaSTO beats prompting baselines by 6.9x Steering F1 on DailyDilemmas and wins on 5 of 6 tested value axes. We also find preliminary evidence that it maintains bidirectional control where prompting triggers refusal.
title AntiPaSTO: Self-Supervised Honesty Steering via Anti-Parallel Representations
topic Machine Learning
url https://arxiv.org/abs/2601.07473