Saved in:
Bibliographic Details
Main Authors: Cui, Xiao, Qin, Yulei, Gao, Yuting, Zhang, Enwei, Xu, Zihan, Wu, Tong, Li, Ke, Sun, Xing, Zhou, Wengang, Li, Houqiang
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2402.17110
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866909123630596096
author Cui, Xiao
Qin, Yulei
Gao, Yuting
Zhang, Enwei
Xu, Zihan
Wu, Tong
Li, Ke
Sun, Xing
Zhou, Wengang
Li, Houqiang
author_facet Cui, Xiao
Qin, Yulei
Gao, Yuting
Zhang, Enwei
Xu, Zihan
Wu, Tong
Li, Ke
Sun, Xing
Zhou, Wengang
Li, Houqiang
contents Knowledge distillation (KD) has been widely adopted to compress large language models (LLMs). Existing KD methods investigate various divergence measures including the Kullback-Leibler (KL), reverse Kullback-Leibler (RKL), and Jensen-Shannon (JS) divergences. However, due to limitations inherent in their assumptions and definitions, these measures fail to deliver effective supervision when few distribution overlap exists between the teacher and the student. In this paper, we show that the aforementioned KL, RKL, and JS divergences respectively suffer from issues of mode-averaging, mode-collapsing, and mode-underestimation, which deteriorates logits-based KD for diverse NLP tasks. We propose the Sinkhorn Knowledge Distillation (SinKD) that exploits the Sinkhorn distance to ensure a nuanced and precise assessment of the disparity between teacher and student distributions. Besides, profit by properties of the Sinkhorn metric, we can get rid of sample-wise KD that restricts the perception of divergence in each teacher-student sample pair. Instead, we propose a batch-wise reformulation to capture geometric intricacies of distributions across samples in the high-dimensional space. Comprehensive evaluation on GLUE and SuperGLUE, in terms of comparability, validity, and generalizability, highlights our superiority over state-of-the-art methods on all kinds of LLMs with encoder-only, encoder-decoder, and decoder-only architectures.
format Preprint
id arxiv_https___arxiv_org_abs_2402_17110
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Sinkhorn Distance Minimization for Knowledge Distillation
Cui, Xiao
Qin, Yulei
Gao, Yuting
Zhang, Enwei
Xu, Zihan
Wu, Tong
Li, Ke
Sun, Xing
Zhou, Wengang
Li, Houqiang
Machine Learning
Computation and Language
Knowledge distillation (KD) has been widely adopted to compress large language models (LLMs). Existing KD methods investigate various divergence measures including the Kullback-Leibler (KL), reverse Kullback-Leibler (RKL), and Jensen-Shannon (JS) divergences. However, due to limitations inherent in their assumptions and definitions, these measures fail to deliver effective supervision when few distribution overlap exists between the teacher and the student. In this paper, we show that the aforementioned KL, RKL, and JS divergences respectively suffer from issues of mode-averaging, mode-collapsing, and mode-underestimation, which deteriorates logits-based KD for diverse NLP tasks. We propose the Sinkhorn Knowledge Distillation (SinKD) that exploits the Sinkhorn distance to ensure a nuanced and precise assessment of the disparity between teacher and student distributions. Besides, profit by properties of the Sinkhorn metric, we can get rid of sample-wise KD that restricts the perception of divergence in each teacher-student sample pair. Instead, we propose a batch-wise reformulation to capture geometric intricacies of distributions across samples in the high-dimensional space. Comprehensive evaluation on GLUE and SuperGLUE, in terms of comparability, validity, and generalizability, highlights our superiority over state-of-the-art methods on all kinds of LLMs with encoder-only, encoder-decoder, and decoder-only architectures.
title Sinkhorn Distance Minimization for Knowledge Distillation
topic Machine Learning
Computation and Language
url https://arxiv.org/abs/2402.17110