Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Huang, Haiduo, Song, Jiangcheng, Zhang, Yadong, Ren, Pengju
Format:	Preprint
Published:	2025
Subjects:	Computation and Language Artificial Intelligence
Online Access:	https://arxiv.org/abs/2510.24021
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914159705194496
author	Huang, Haiduo Song, Jiangcheng Zhang, Yadong Ren, Pengju
author_facet	Huang, Haiduo Song, Jiangcheng Zhang, Yadong Ren, Pengju
contents	Knowledge distillation (KD) is a standard route to compress Large Language Models (LLMs) into compact students, yet most pipelines uniformly apply token-wise loss regardless of teacher confidence. This indiscriminate supervision amplifies noisy, high-entropy signals and is especially harmful under large teacher-student capacity gaps. We introduce SelecTKD, a plug-and-play Selective Token-Weighted distillation framework that shifts the focus from "how to measure divergence" to "where to apply learning". At each step, the student proposes tokens that are verified by the teacher through a robust propose-and-verify procedure with two variants: greedy Top-k and non-greedy Spec-k. Accepted tokens receive full loss, while rejected tokens are masked or down-weighted. This objective-agnostic design works with on- and off-policy data, induces an implicit curriculum quantified by Token Acceptance Rate (TAR), and stabilizes optimization. Across instruction following, mathematical reasoning, code generation, and a VLM setting, SelecTKD consistently improves strong baselines and achieves state-of-the-art results for small models without architectural changes or extra reference models.
format	Preprint
id	arxiv_https___arxiv_org_abs_2510_24021
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	SelecTKD: Selective Token-Weighted Knowledge Distillation for LLMs Huang, Haiduo Song, Jiangcheng Zhang, Yadong Ren, Pengju Computation and Language Artificial Intelligence Knowledge distillation (KD) is a standard route to compress Large Language Models (LLMs) into compact students, yet most pipelines uniformly apply token-wise loss regardless of teacher confidence. This indiscriminate supervision amplifies noisy, high-entropy signals and is especially harmful under large teacher-student capacity gaps. We introduce SelecTKD, a plug-and-play Selective Token-Weighted distillation framework that shifts the focus from "how to measure divergence" to "where to apply learning". At each step, the student proposes tokens that are verified by the teacher through a robust propose-and-verify procedure with two variants: greedy Top-k and non-greedy Spec-k. Accepted tokens receive full loss, while rejected tokens are masked or down-weighted. This objective-agnostic design works with on- and off-policy data, induces an implicit curriculum quantified by Token Acceptance Rate (TAR), and stabilizes optimization. Across instruction following, mathematical reasoning, code generation, and a VLM setting, SelecTKD consistently improves strong baselines and achieves state-of-the-art results for small models without architectural changes or extra reference models.
title	SelecTKD: Selective Token-Weighted Knowledge Distillation for LLMs
topic	Computation and Language Artificial Intelligence
url	https://arxiv.org/abs/2510.24021

Similar Items