Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Jiang, Wenyu, Liu, Zhenlong, Xie, Zejian, Zhang, Songxin, Jing, Bingyi, Wei, Hongxin
Format:	Preprint
Published:	2024
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2402.05356
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916725311668224
author	Jiang, Wenyu Liu, Zhenlong Xie, Zejian Zhang, Songxin Jing, Bingyi Wei, Hongxin
author_facet	Jiang, Wenyu Liu, Zhenlong Xie, Zejian Zhang, Songxin Jing, Bingyi Wei, Hongxin
contents	The ever-increasing fine-tuning cost of large-scale pre-trained models gives rise to the importance of dataset pruning, which aims to reduce dataset size while maintaining task performance. However, existing dataset pruning methods require training on the entire dataset, which is impractical for large-scale pre-trained models. In this paper, we propose a straightforward, novel, and training-free hardness score named Distorting-based Learning Complexity (DLC), to identify informative images and instructions from the downstream dataset efficiently. Our method is motivated by the observation that easy samples learned faster can also be learned with fewer parameters. Specifically, we define the Learning Complexity to quantify sample hardness and utilize a lightweight weights masking process for fast estimation, instead of the costly SGD optimization. Based on DLC, we further design a flexible under-sampling with randomness (dubbed FlexRand), replacing the top-K strategy, to alleviate the severe subset distribution shift. Extensive experiments with downstream image and instructions dataset pruning benchmarks demonstrate the effectiveness and efficiency of the proposed approach. In the images pruning benchmark, DLC significantly reduces the pruning time by 35x while establishing state-of-the-art performance with FlexRand.
format	Preprint
id	arxiv_https___arxiv_org_abs_2402_05356
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Exploring Learning Complexity for Efficient Downstream Dataset Pruning Jiang, Wenyu Liu, Zhenlong Xie, Zejian Zhang, Songxin Jing, Bingyi Wei, Hongxin Machine Learning The ever-increasing fine-tuning cost of large-scale pre-trained models gives rise to the importance of dataset pruning, which aims to reduce dataset size while maintaining task performance. However, existing dataset pruning methods require training on the entire dataset, which is impractical for large-scale pre-trained models. In this paper, we propose a straightforward, novel, and training-free hardness score named Distorting-based Learning Complexity (DLC), to identify informative images and instructions from the downstream dataset efficiently. Our method is motivated by the observation that easy samples learned faster can also be learned with fewer parameters. Specifically, we define the Learning Complexity to quantify sample hardness and utilize a lightweight weights masking process for fast estimation, instead of the costly SGD optimization. Based on DLC, we further design a flexible under-sampling with randomness (dubbed FlexRand), replacing the top-K strategy, to alleviate the severe subset distribution shift. Extensive experiments with downstream image and instructions dataset pruning benchmarks demonstrate the effectiveness and efficiency of the proposed approach. In the images pruning benchmark, DLC significantly reduces the pruning time by 35x while establishing state-of-the-art performance with FlexRand.
title	Exploring Learning Complexity for Efficient Downstream Dataset Pruning
topic	Machine Learning
url	https://arxiv.org/abs/2402.05356

Similar Items