Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Schulz, Julian, Fallows, Seamus
Format:	Preprint
Published:	2025
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2504.00754
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913770216882176
author	Schulz, Julian Fallows, Seamus
author_facet	Schulz, Julian Fallows, Seamus
contents	We present a novel approach to feature labeling using gradient descent in token-space. While existing methods typically use language models to generate hypotheses about feature meanings, our method directly optimizes label representations by using a language model as a discriminator to predict feature activations. We formulate this as a multi-objective optimization problem in token-space, balancing prediction accuracy, entropy minimization, and linguistic naturalness. Our proof-of-concept experiments demonstrate successful convergence to interpretable single-token labels across diverse domains, including features for detecting animals, mammals, Chinese text, and numbers. Although our current implementation is constrained to single-token labels and relatively simple features, the results suggest that token-space gradient descent could become a valuable addition to the interpretability researcher's toolkit.
format	Preprint
id	arxiv_https___arxiv_org_abs_2504_00754
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Automated Feature Labeling with Token-Space Gradient Descent Schulz, Julian Fallows, Seamus Machine Learning We present a novel approach to feature labeling using gradient descent in token-space. While existing methods typically use language models to generate hypotheses about feature meanings, our method directly optimizes label representations by using a language model as a discriminator to predict feature activations. We formulate this as a multi-objective optimization problem in token-space, balancing prediction accuracy, entropy minimization, and linguistic naturalness. Our proof-of-concept experiments demonstrate successful convergence to interpretable single-token labels across diverse domains, including features for detecting animals, mammals, Chinese text, and numbers. Although our current implementation is constrained to single-token labels and relatively simple features, the results suggest that token-space gradient descent could become a valuable addition to the interpretability researcher's toolkit.
title	Automated Feature Labeling with Token-Space Gradient Descent
topic	Machine Learning
url	https://arxiv.org/abs/2504.00754

Similar Items