Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Author:	Kapelko, Eduard
Format:	Preprint
Published:	2025
Subjects:	Computation and Language Machine Learning
Online Access:	https://arxiv.org/abs/2509.25220
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908566733979648
author	Kapelko, Eduard
author_facet	Kapelko, Eduard
contents	Safety and controllability are critical for large language models. A central question is whether undesirable behaviors like deception are localized functions that can be removed, or if they are deeply intertwined with a model's core cognitive abilities. We introduce "cyclic ablation," an iterative method to test this. By combining sparse autoencoders, targeted ablation, and adversarial training on DistilGPT-2, we attempted to eliminate the concept of deception. We found that, contrary to the localization hypothesis, deception was highly resilient. The model consistently recovered its deceptive behavior after each ablation cycle via adversarial training, a process we term functional regeneration. Crucially, every attempt at this "neurosurgery" caused a gradual but measurable decay in general linguistic performance, reflected by a consistent rise in perplexity. These findings are consistent with the view that complex concepts are distributed and entangled, underscoring the limitations of direct model editing through mechanistic interpretability.
format	Preprint
id	arxiv_https___arxiv_org_abs_2509_25220
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Cyclic Ablation: Testing Concept Localization against Functional Regeneration in AI Kapelko, Eduard Computation and Language Machine Learning Safety and controllability are critical for large language models. A central question is whether undesirable behaviors like deception are localized functions that can be removed, or if they are deeply intertwined with a model's core cognitive abilities. We introduce "cyclic ablation," an iterative method to test this. By combining sparse autoencoders, targeted ablation, and adversarial training on DistilGPT-2, we attempted to eliminate the concept of deception. We found that, contrary to the localization hypothesis, deception was highly resilient. The model consistently recovered its deceptive behavior after each ablation cycle via adversarial training, a process we term functional regeneration. Crucially, every attempt at this "neurosurgery" caused a gradual but measurable decay in general linguistic performance, reflected by a consistent rise in perplexity. These findings are consistent with the view that complex concepts are distributed and entangled, underscoring the limitations of direct model editing through mechanistic interpretability.
title	Cyclic Ablation: Testing Concept Localization against Functional Regeneration in AI
topic	Computation and Language Machine Learning
url	https://arxiv.org/abs/2509.25220

Similar Items