Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wieciech, Bartosz, Awrahman, Zmnako, Czelej, Marcin, Velasquez, Victor Hugo Jaramillo, Stobieniecka, Wioletta
Format:	Preprint
Published:	2026
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2605.28149
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911723757240320
author	Wieciech, Bartosz Awrahman, Zmnako Czelej, Marcin Velasquez, Victor Hugo Jaramillo Stobieniecka, Wioletta
author_facet	Wieciech, Bartosz Awrahman, Zmnako Czelej, Marcin Velasquez, Victor Hugo Jaramillo Stobieniecka, Wioletta
contents	Sparse Autoencoders (SAEs) extract interpretable features from Large Language Models, but standard variants enforce non-negativity, forcing separate latents for diametrically opposed concepts (e.g., "pressure too high" vs. "pressure too low") and wasting dictionary capacity when features are anticorrelated. We propose the Sign-Aware Gated SAE (SA-GSAE): two-sided gated sparsity with signed magnitude and auxiliary supervision. A polarity-sensitive gate selects support on either sign, a signed-magnitude path avoids L1 shrinkage, and an auxiliary reconstruction prevents gate collapse. Bipolar sharing - one latent encoding both signs along a shared direction - is realised via a new Bi-Jump-ReLU activation; parameter accounting shows sign-awareness stays parameter-efficient even when anticorrelated pairs are rare. On real LLM activations across three mid-depth hookpoints on Pythia-1B and SmolLM3-3B (6 cells, 3 seeds), a half-width SA-GSAE at width H strictly Pareto-dominates a full-width Gated SAE at 2H over the entire swept L0 overlap on 3 of 6 cells (both MLP-output hookpoints and resid-mid/Pythia-1B); on the remaining 3 it matches R^2 within 0.025 (max gap -0.008) while cutting dead fraction by 0.35-0.62 absolute. Sweep-geomean dead-fraction reductions are ~100x-500x on MLP-output cells and Pythia-1B resid, ~2x-4x on attention cells and SmolLM3-3B resid. Ablations show the two-sided gate and auxiliary loss are load-bearing (no auxiliary collapses LR to 0.27, 98% dead); tying r_i^+ = r_i^- is indistinguishable (\|Delta R^2\| = 0.0015), and we recommend this symmetric variant as default. MLP-output gains come from most latents carrying both polarities; on attention, bipolar structure concentrates in a small set of top latents. Full-width SA-GSAE exhibits a reproducible reconstruction collapse at SmolLM3-3B resid that the half-width entirely avoids.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_28149
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Sign-Aware Gated Sparse Autoencoders: Modeling Anticorrelated Features with Bi-Jump-ReLU Activations Wieciech, Bartosz Awrahman, Zmnako Czelej, Marcin Velasquez, Victor Hugo Jaramillo Stobieniecka, Wioletta Machine Learning Sparse Autoencoders (SAEs) extract interpretable features from Large Language Models, but standard variants enforce non-negativity, forcing separate latents for diametrically opposed concepts (e.g., "pressure too high" vs. "pressure too low") and wasting dictionary capacity when features are anticorrelated. We propose the Sign-Aware Gated SAE (SA-GSAE): two-sided gated sparsity with signed magnitude and auxiliary supervision. A polarity-sensitive gate selects support on either sign, a signed-magnitude path avoids L1 shrinkage, and an auxiliary reconstruction prevents gate collapse. Bipolar sharing - one latent encoding both signs along a shared direction - is realised via a new Bi-Jump-ReLU activation; parameter accounting shows sign-awareness stays parameter-efficient even when anticorrelated pairs are rare. On real LLM activations across three mid-depth hookpoints on Pythia-1B and SmolLM3-3B (6 cells, 3 seeds), a half-width SA-GSAE at width H strictly Pareto-dominates a full-width Gated SAE at 2H over the entire swept L0 overlap on 3 of 6 cells (both MLP-output hookpoints and resid-mid/Pythia-1B); on the remaining 3 it matches R^2 within 0.025 (max gap -0.008) while cutting dead fraction by 0.35-0.62 absolute. Sweep-geomean dead-fraction reductions are ~100x-500x on MLP-output cells and Pythia-1B resid, ~2x-4x on attention cells and SmolLM3-3B resid. Ablations show the two-sided gate and auxiliary loss are load-bearing (no auxiliary collapses LR to 0.27, 98% dead); tying r_i^+ = r_i^- is indistinguishable (\|Delta R^2\| = 0.0015), and we recommend this symmetric variant as default. MLP-output gains come from most latents carrying both polarities; on attention, bipolar structure concentrates in a small set of top latents. Full-width SA-GSAE exhibits a reproducible reconstruction collapse at SmolLM3-3B resid that the half-width entirely avoids.
title	Sign-Aware Gated Sparse Autoencoders: Modeling Anticorrelated Features with Bi-Jump-ReLU Activations
topic	Machine Learning
url	https://arxiv.org/abs/2605.28149

Similar Items