Saved in:
Bibliographic Details
Main Authors: Tavor, Almog, Ebenspanger, Itay, Cnaan, Neil, Geva, Mor
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.01395
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866918318199275520
author Tavor, Almog
Ebenspanger, Itay
Cnaan, Neil
Geva, Mor
author_facet Tavor, Almog
Ebenspanger, Itay
Cnaan, Neil
Geva, Mor
contents Growing efforts to improve knowledge distillation (KD) in large language models (LLMs) replace dense teacher supervision with selective distillation, which uses a subset of token positions, vocabulary classes, or training samples for supervision. However, it remains unclear which importance signals, selection policies, and their interplay are most effective. In this work, we revisit where and how to distill in autoregressive LLMs. We disentangle selective KD along the position, class, and sample axes and systematically compare importance signals and selection policies. Then, guided by this analysis, we identify underexplored opportunities and introduce student-entropy-guided position selection (SE-KD). Across a suite of benchmarks, SE-KD often improves accuracy, downstream task adherence, and memory efficiency over dense distillation. Extending this approach across the class and sample axes (SE-KD 3X) yields complementary efficiency gains that make offline teacher caching feasible. In practice, this reduces wall time by 70% and peak memory by 18%, while cutting storage usage by 80% over prior methods without sacrificing performance.
format Preprint
id arxiv_https___arxiv_org_abs_2602_01395
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Rethinking Selective Knowledge Distillation
Tavor, Almog
Ebenspanger, Itay
Cnaan, Neil
Geva, Mor
Computation and Language
Growing efforts to improve knowledge distillation (KD) in large language models (LLMs) replace dense teacher supervision with selective distillation, which uses a subset of token positions, vocabulary classes, or training samples for supervision. However, it remains unclear which importance signals, selection policies, and their interplay are most effective. In this work, we revisit where and how to distill in autoregressive LLMs. We disentangle selective KD along the position, class, and sample axes and systematically compare importance signals and selection policies. Then, guided by this analysis, we identify underexplored opportunities and introduce student-entropy-guided position selection (SE-KD). Across a suite of benchmarks, SE-KD often improves accuracy, downstream task adherence, and memory efficiency over dense distillation. Extending this approach across the class and sample axes (SE-KD 3X) yields complementary efficiency gains that make offline teacher caching feasible. In practice, this reduces wall time by 70% and peak memory by 18%, while cutting storage usage by 80% over prior methods without sacrificing performance.
title Rethinking Selective Knowledge Distillation
topic Computation and Language
url https://arxiv.org/abs/2602.01395