Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Stern, Uri, Corn, Eli, Weinshall, Daphna
Format:	Preprint
Published:	2025
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2507.08686
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911051482660864
author	Stern, Uri Corn, Eli Weinshall, Daphna
author_facet	Stern, Uri Corn, Eli Weinshall, Daphna
contents	Overfitting in deep neural networks occurs less frequently than expected. This is a puzzling observation, as theory predicts that greater model capacity should eventually lead to overfitting -- yet this is rarely seen in practice. But what if overfitting does occur, not globally, but in specific sub-regions of the data space? In this work, we introduce a novel score that measures the forgetting rate of deep models on validation data, capturing what we term local overfitting: a performance degradation confined to certain regions of the input space. We demonstrate that local overfitting can arise even without conventional overfitting, and is closely linked to the double descent phenomenon. Building on these insights, we introduce a two-stage approach that leverages the training history of a single model to recover and retain forgotten knowledge: first, by aggregating checkpoints into an ensemble, and then by distilling it into a single model of the original size, thus enhancing performance without added inference cost. Extensive experiments across multiple datasets, modern architectures, and training regimes validate the effectiveness of our approach. Notably, in the presence of label noise, our method -- Knowledge Fusion followed by Knowledge Distillation -- outperforms both the original model and independently trained ensembles, achieving a rare win-win scenario: reduced training and inference complexity.
format	Preprint
id	arxiv_https___arxiv_org_abs_2507_08686
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Forget Me Not: Fighting Local Overfitting with Knowledge Fusion and Distillation Stern, Uri Corn, Eli Weinshall, Daphna Machine Learning Overfitting in deep neural networks occurs less frequently than expected. This is a puzzling observation, as theory predicts that greater model capacity should eventually lead to overfitting -- yet this is rarely seen in practice. But what if overfitting does occur, not globally, but in specific sub-regions of the data space? In this work, we introduce a novel score that measures the forgetting rate of deep models on validation data, capturing what we term local overfitting: a performance degradation confined to certain regions of the input space. We demonstrate that local overfitting can arise even without conventional overfitting, and is closely linked to the double descent phenomenon. Building on these insights, we introduce a two-stage approach that leverages the training history of a single model to recover and retain forgotten knowledge: first, by aggregating checkpoints into an ensemble, and then by distilling it into a single model of the original size, thus enhancing performance without added inference cost. Extensive experiments across multiple datasets, modern architectures, and training regimes validate the effectiveness of our approach. Notably, in the presence of label noise, our method -- Knowledge Fusion followed by Knowledge Distillation -- outperforms both the original model and independently trained ensembles, achieving a rare win-win scenario: reduced training and inference complexity.
title	Forget Me Not: Fighting Local Overfitting with Knowledge Fusion and Distillation
topic	Machine Learning
url	https://arxiv.org/abs/2507.08686

Similar Items