Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Chaudhary, Maheep, Geiger, Atticus
Format:	Preprint
Published:	2024
Subjects:	Machine Learning Artificial Intelligence Neural and Evolutionary Computing
Online Access:	https://arxiv.org/abs/2409.04478
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917770729357312
author	Chaudhary, Maheep Geiger, Atticus
author_facet	Chaudhary, Maheep Geiger, Atticus
contents	A popular new method in mechanistic interpretability is to train high-dimensional sparse autoencoders (SAEs) on neuron activations and use SAE features as the atomic units of analysis. However, the body of evidence on whether SAE feature spaces are useful for causal analysis is underdeveloped. In this work, we use the RAVEL benchmark to evaluate whether SAEs trained on hidden representations of GPT-2 small have sets of features that separately mediate knowledge of which country a city is in and which continent it is in. We evaluate four open-source SAEs for GPT-2 small against each other, with neurons serving as a baseline, and linear features learned via distributed alignment search (DAS) serving as a skyline. For each, we learn a binary mask to select features that will be patched to change the country of a city without changing the continent, or vice versa. Our results show that SAEs struggle to reach the neuron baseline, and none come close to the DAS skyline. We release code here: https://github.com/MaheepChaudhary/SAE-Ravel
format	Preprint
id	arxiv_https___arxiv_org_abs_2409_04478
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small Chaudhary, Maheep Geiger, Atticus Machine Learning Artificial Intelligence Neural and Evolutionary Computing A popular new method in mechanistic interpretability is to train high-dimensional sparse autoencoders (SAEs) on neuron activations and use SAE features as the atomic units of analysis. However, the body of evidence on whether SAE feature spaces are useful for causal analysis is underdeveloped. In this work, we use the RAVEL benchmark to evaluate whether SAEs trained on hidden representations of GPT-2 small have sets of features that separately mediate knowledge of which country a city is in and which continent it is in. We evaluate four open-source SAEs for GPT-2 small against each other, with neurons serving as a baseline, and linear features learned via distributed alignment search (DAS) serving as a skyline. For each, we learn a binary mask to select features that will be patched to change the country of a city without changing the continent, or vice versa. Our results show that SAEs struggle to reach the neuron baseline, and none come close to the DAS skyline. We release code here: https://github.com/MaheepChaudhary/SAE-Ravel
title	Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small
topic	Machine Learning Artificial Intelligence Neural and Evolutionary Computing
url	https://arxiv.org/abs/2409.04478

Similar Items