Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Tavor, Almog, Ebenspanger, Itay, Cnaan, Neil, Geva, Mor
Format:	Preprint
Published:	2026
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2602.01395
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866918318199275520
author	Tavor, Almog Ebenspanger, Itay Cnaan, Neil Geva, Mor
author_facet	Tavor, Almog Ebenspanger, Itay Cnaan, Neil Geva, Mor
contents	Growing efforts to improve knowledge distillation (KD) in large language models (LLMs) replace dense teacher supervision with selective distillation, which uses a subset of token positions, vocabulary classes, or training samples for supervision. However, it remains unclear which importance signals, selection policies, and their interplay are most effective. In this work, we revisit where and how to distill in autoregressive LLMs. We disentangle selective KD along the position, class, and sample axes and systematically compare importance signals and selection policies. Then, guided by this analysis, we identify underexplored opportunities and introduce student-entropy-guided position selection (SE-KD). Across a suite of benchmarks, SE-KD often improves accuracy, downstream task adherence, and memory efficiency over dense distillation. Extending this approach across the class and sample axes (SE-KD 3X) yields complementary efficiency gains that make offline teacher caching feasible. In practice, this reduces wall time by 70% and peak memory by 18%, while cutting storage usage by 80% over prior methods without sacrificing performance.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_01395
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Rethinking Selective Knowledge Distillation Tavor, Almog Ebenspanger, Itay Cnaan, Neil Geva, Mor Computation and Language Growing efforts to improve knowledge distillation (KD) in large language models (LLMs) replace dense teacher supervision with selective distillation, which uses a subset of token positions, vocabulary classes, or training samples for supervision. However, it remains unclear which importance signals, selection policies, and their interplay are most effective. In this work, we revisit where and how to distill in autoregressive LLMs. We disentangle selective KD along the position, class, and sample axes and systematically compare importance signals and selection policies. Then, guided by this analysis, we identify underexplored opportunities and introduce student-entropy-guided position selection (SE-KD). Across a suite of benchmarks, SE-KD often improves accuracy, downstream task adherence, and memory efficiency over dense distillation. Extending this approach across the class and sample axes (SE-KD 3X) yields complementary efficiency gains that make offline teacher caching feasible. In practice, this reduces wall time by 70% and peak memory by 18%, while cutting storage usage by 80% over prior methods without sacrificing performance.
title	Rethinking Selective Knowledge Distillation
topic	Computation and Language
url	https://arxiv.org/abs/2602.01395

Similar Items