Saved in:
Bibliographic Details
Main Author: Kapelko, Eduard
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2509.25220
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866908566733979648
author Kapelko, Eduard
author_facet Kapelko, Eduard
contents Safety and controllability are critical for large language models. A central question is whether undesirable behaviors like deception are localized functions that can be removed, or if they are deeply intertwined with a model's core cognitive abilities. We introduce "cyclic ablation," an iterative method to test this. By combining sparse autoencoders, targeted ablation, and adversarial training on DistilGPT-2, we attempted to eliminate the concept of deception. We found that, contrary to the localization hypothesis, deception was highly resilient. The model consistently recovered its deceptive behavior after each ablation cycle via adversarial training, a process we term functional regeneration. Crucially, every attempt at this "neurosurgery" caused a gradual but measurable decay in general linguistic performance, reflected by a consistent rise in perplexity. These findings are consistent with the view that complex concepts are distributed and entangled, underscoring the limitations of direct model editing through mechanistic interpretability.
format Preprint
id arxiv_https___arxiv_org_abs_2509_25220
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Cyclic Ablation: Testing Concept Localization against Functional Regeneration in AI
Kapelko, Eduard
Computation and Language
Machine Learning
Safety and controllability are critical for large language models. A central question is whether undesirable behaviors like deception are localized functions that can be removed, or if they are deeply intertwined with a model's core cognitive abilities. We introduce "cyclic ablation," an iterative method to test this. By combining sparse autoencoders, targeted ablation, and adversarial training on DistilGPT-2, we attempted to eliminate the concept of deception. We found that, contrary to the localization hypothesis, deception was highly resilient. The model consistently recovered its deceptive behavior after each ablation cycle via adversarial training, a process we term functional regeneration. Crucially, every attempt at this "neurosurgery" caused a gradual but measurable decay in general linguistic performance, reflected by a consistent rise in perplexity. These findings are consistent with the view that complex concepts are distributed and entangled, underscoring the limitations of direct model editing through mechanistic interpretability.
title Cyclic Ablation: Testing Concept Localization against Functional Regeneration in AI
topic Computation and Language
Machine Learning
url https://arxiv.org/abs/2509.25220