Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Kodama, Shingo, Cohen, Niv, Adler, Micah, Shavit, Nir
Format: Preprint
Veröffentlicht: 2026
Schlagworte:
Online-Zugang:https://arxiv.org/abs/2602.19215
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
_version_ 1866908847461892096
author Kodama, Shingo
Cohen, Niv
Adler, Micah
Shavit, Nir
author_facet Kodama, Shingo
Cohen, Niv
Adler, Micah
Shavit, Nir
contents While many recent methods aim to unlearn or remove knowledge from pretrained models, seemingly erased knowledge often persists and can be recovered in various ways. Because large foundation models are far from interpretable, understanding whether and how such knowledge persists remains a significant challenge. To address this, we turn to the recently developed framework of combinatorial interpretability. This framework, designed for two-layer neural networks, enables direct inspection of the knowledge encoded in the model weights. We reproduce baseline unlearning methods within the combinatorial interpretability setting and examine their behavior along two dimensions: (i) whether they truly remove knowledge of a target concept (the concept we wish to remove) or merely inhibit its expression while retaining the underlying information, and (ii) how easily the supposedly erased knowledge can be recovered through various fine-tuning operations. Our results shed light within a fully interpretable setting on how knowledge can persist despite unlearning and when it might resurface.
format Preprint
id arxiv_https___arxiv_org_abs_2602_19215
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Understanding Empirical Unlearning with Combinatorial Interpretability
Kodama, Shingo
Cohen, Niv
Adler, Micah
Shavit, Nir
Machine Learning
While many recent methods aim to unlearn or remove knowledge from pretrained models, seemingly erased knowledge often persists and can be recovered in various ways. Because large foundation models are far from interpretable, understanding whether and how such knowledge persists remains a significant challenge. To address this, we turn to the recently developed framework of combinatorial interpretability. This framework, designed for two-layer neural networks, enables direct inspection of the knowledge encoded in the model weights. We reproduce baseline unlearning methods within the combinatorial interpretability setting and examine their behavior along two dimensions: (i) whether they truly remove knowledge of a target concept (the concept we wish to remove) or merely inhibit its expression while retaining the underlying information, and (ii) how easily the supposedly erased knowledge can be recovered through various fine-tuning operations. Our results shed light within a fully interpretable setting on how knowledge can persist despite unlearning and when it might resurface.
title Understanding Empirical Unlearning with Combinatorial Interpretability
topic Machine Learning
url https://arxiv.org/abs/2602.19215