Saved in:
Bibliographic Details
Main Authors: Ramnauth, Rebecca, Scassellati, Brian
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2605.28639
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911725104660480
author Ramnauth, Rebecca
Scassellati, Brian
author_facet Ramnauth, Rebecca
Scassellati, Brian
contents Instruction-based suppression is widely used to prevent language models from generating prohibited content, yet it remains unclear whether suppression reduces internal representation or merely suppresses expression. We investigate this question through representational probing, attention analysis, and behavioral semantic leakage experiments across multiple transformer models. We find that prohibited concepts remain highly recoverable from hidden representations under suppression, continue to influence attention routing, and measurably shape downstream generations despite successful lexical avoidance. These effects persist across pooling strategies, indirect semantic controls, and multiple model families. Our results expose a fundamental gap between behavioral and representational alignment.
format Preprint
id arxiv_https___arxiv_org_abs_2605_28639
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle The Attentional White Bear Effect in Transformer Language Models
Ramnauth, Rebecca
Scassellati, Brian
Computation and Language
Artificial Intelligence
Instruction-based suppression is widely used to prevent language models from generating prohibited content, yet it remains unclear whether suppression reduces internal representation or merely suppresses expression. We investigate this question through representational probing, attention analysis, and behavioral semantic leakage experiments across multiple transformer models. We find that prohibited concepts remain highly recoverable from hidden representations under suppression, continue to influence attention routing, and measurably shape downstream generations despite successful lexical avoidance. These effects persist across pooling strategies, indirect semantic controls, and multiple model families. Our results expose a fundamental gap between behavioral and representational alignment.
title The Attentional White Bear Effect in Transformer Language Models
topic Computation and Language
Artificial Intelligence
url https://arxiv.org/abs/2605.28639