Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Author:	Clark, Michael J.
Format:	Preprint
Published:	2026
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2601.07473
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911672595120128
author	Clark, Michael J.
author_facet	Clark, Michael J.
contents	As models grow more capable, humans cannot reliably verify what they say. Scalable steering requires methods that are internal, self-supervised, and transfer out-of-distribution; existing methods satisfy some but not all three. We introduce AntiPaSTO, which separates representations along an antiparallel axis (+1/-1 produce opposite shifts), with coherence constraints preventing collapse. Training uses only two contrasting words inserted into template sentences, with no preference labels. When we use 800 such synthetic pairs on Gemma-3-1B, AntiPaSTO beats prompting baselines by 6.9x Steering F1 on DailyDilemmas and wins on 5 of 6 tested value axes. We also find preliminary evidence that it maintains bidirectional control where prompting triggers refusal.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_07473
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	AntiPaSTO: Self-Supervised Honesty Steering via Anti-Parallel Representations Clark, Michael J. Machine Learning As models grow more capable, humans cannot reliably verify what they say. Scalable steering requires methods that are internal, self-supervised, and transfer out-of-distribution; existing methods satisfy some but not all three. We introduce AntiPaSTO, which separates representations along an antiparallel axis (+1/-1 produce opposite shifts), with coherence constraints preventing collapse. Training uses only two contrasting words inserted into template sentences, with no preference labels. When we use 800 such synthetic pairs on Gemma-3-1B, AntiPaSTO beats prompting baselines by 6.9x Steering F1 on DailyDilemmas and wins on 5 of 6 tested value axes. We also find preliminary evidence that it maintains bidirectional control where prompting triggers refusal.
title	AntiPaSTO: Self-Supervised Honesty Steering via Anti-Parallel Representations
topic	Machine Learning
url	https://arxiv.org/abs/2601.07473

Similar Items