Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Cui, Xiao, Qin, Yulei, Gao, Yuting, Zhang, Enwei, Xu, Zihan, Wu, Tong, Li, Ke, Sun, Xing, Zhou, Wengang, Li, Houqiang
Format:	Preprint
Published:	2024
Subjects:	Machine Learning Computation and Language
Online Access:	https://arxiv.org/abs/2402.17110
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909123630596096
author	Cui, Xiao Qin, Yulei Gao, Yuting Zhang, Enwei Xu, Zihan Wu, Tong Li, Ke Sun, Xing Zhou, Wengang Li, Houqiang
author_facet	Cui, Xiao Qin, Yulei Gao, Yuting Zhang, Enwei Xu, Zihan Wu, Tong Li, Ke Sun, Xing Zhou, Wengang Li, Houqiang
contents	Knowledge distillation (KD) has been widely adopted to compress large language models (LLMs). Existing KD methods investigate various divergence measures including the Kullback-Leibler (KL), reverse Kullback-Leibler (RKL), and Jensen-Shannon (JS) divergences. However, due to limitations inherent in their assumptions and definitions, these measures fail to deliver effective supervision when few distribution overlap exists between the teacher and the student. In this paper, we show that the aforementioned KL, RKL, and JS divergences respectively suffer from issues of mode-averaging, mode-collapsing, and mode-underestimation, which deteriorates logits-based KD for diverse NLP tasks. We propose the Sinkhorn Knowledge Distillation (SinKD) that exploits the Sinkhorn distance to ensure a nuanced and precise assessment of the disparity between teacher and student distributions. Besides, profit by properties of the Sinkhorn metric, we can get rid of sample-wise KD that restricts the perception of divergence in each teacher-student sample pair. Instead, we propose a batch-wise reformulation to capture geometric intricacies of distributions across samples in the high-dimensional space. Comprehensive evaluation on GLUE and SuperGLUE, in terms of comparability, validity, and generalizability, highlights our superiority over state-of-the-art methods on all kinds of LLMs with encoder-only, encoder-decoder, and decoder-only architectures.
format	Preprint
id	arxiv_https___arxiv_org_abs_2402_17110
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Sinkhorn Distance Minimization for Knowledge Distillation Cui, Xiao Qin, Yulei Gao, Yuting Zhang, Enwei Xu, Zihan Wu, Tong Li, Ke Sun, Xing Zhou, Wengang Li, Houqiang Machine Learning Computation and Language Knowledge distillation (KD) has been widely adopted to compress large language models (LLMs). Existing KD methods investigate various divergence measures including the Kullback-Leibler (KL), reverse Kullback-Leibler (RKL), and Jensen-Shannon (JS) divergences. However, due to limitations inherent in their assumptions and definitions, these measures fail to deliver effective supervision when few distribution overlap exists between the teacher and the student. In this paper, we show that the aforementioned KL, RKL, and JS divergences respectively suffer from issues of mode-averaging, mode-collapsing, and mode-underestimation, which deteriorates logits-based KD for diverse NLP tasks. We propose the Sinkhorn Knowledge Distillation (SinKD) that exploits the Sinkhorn distance to ensure a nuanced and precise assessment of the disparity between teacher and student distributions. Besides, profit by properties of the Sinkhorn metric, we can get rid of sample-wise KD that restricts the perception of divergence in each teacher-student sample pair. Instead, we propose a batch-wise reformulation to capture geometric intricacies of distributions across samples in the high-dimensional space. Comprehensive evaluation on GLUE and SuperGLUE, in terms of comparability, validity, and generalizability, highlights our superiority over state-of-the-art methods on all kinds of LLMs with encoder-only, encoder-decoder, and decoder-only architectures.
title	Sinkhorn Distance Minimization for Knowledge Distillation
topic	Machine Learning Computation and Language
url	https://arxiv.org/abs/2402.17110

Similar Items