Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Author:	Davies, Harry J
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2501.02688
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912251451015168
author	Davies, Harry J
author_facet	Davies, Harry J
contents	Large Language Models (LLMs) typically have billions of parameters and are thus often difficult to interpret in their operation. In this work, we demonstrate that it is possible to decode neuron weights directly into token probabilities through the final projection layer of the model (the LM-head). This is illustrated in Llama 3.1 8B where we use the LM-head to find examples of specialised feature neurons such as a "dog" neuron and a "California" neuron, and we validate this by clamping these neurons to affect the probability of the concept in the output. We evaluate this method on both the pre-trained and Instruct models, finding that over 75% of neurons in the up-projection layers in the instruct model have the same top associated token compared to the pretrained model. Finally, we demonstrate that clamping the "dog" neuron leads the instruct model to always discuss dogs when asked about its favourite animal. Through our method, it is possible to map the top features of the entirety of Llama 3.1 8B's up-projection neurons in less than 10 seconds, with minimal compute.
format	Preprint
id	arxiv_https___arxiv_org_abs_2501_02688
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Flash Interpretability: Decoding Specialised Feature Neurons in Large Language Models with the LM-Head Davies, Harry J Computation and Language Large Language Models (LLMs) typically have billions of parameters and are thus often difficult to interpret in their operation. In this work, we demonstrate that it is possible to decode neuron weights directly into token probabilities through the final projection layer of the model (the LM-head). This is illustrated in Llama 3.1 8B where we use the LM-head to find examples of specialised feature neurons such as a "dog" neuron and a "California" neuron, and we validate this by clamping these neurons to affect the probability of the concept in the output. We evaluate this method on both the pre-trained and Instruct models, finding that over 75% of neurons in the up-projection layers in the instruct model have the same top associated token compared to the pretrained model. Finally, we demonstrate that clamping the "dog" neuron leads the instruct model to always discuss dogs when asked about its favourite animal. Through our method, it is possible to map the top features of the entirety of Llama 3.1 8B's up-projection neurons in less than 10 seconds, with minimal compute.
title	Flash Interpretability: Decoding Specialised Feature Neurons in Large Language Models with the LM-Head
topic	Computation and Language
url	https://arxiv.org/abs/2501.02688

Similar Items