Saved in:
Bibliographic Details
Main Authors: Jiang, Wenyu, Liu, Zhenlong, Xie, Zejian, Zhang, Songxin, Jing, Bingyi, Wei, Hongxin
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2402.05356
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866916725311668224
author Jiang, Wenyu
Liu, Zhenlong
Xie, Zejian
Zhang, Songxin
Jing, Bingyi
Wei, Hongxin
author_facet Jiang, Wenyu
Liu, Zhenlong
Xie, Zejian
Zhang, Songxin
Jing, Bingyi
Wei, Hongxin
contents The ever-increasing fine-tuning cost of large-scale pre-trained models gives rise to the importance of dataset pruning, which aims to reduce dataset size while maintaining task performance. However, existing dataset pruning methods require training on the entire dataset, which is impractical for large-scale pre-trained models. In this paper, we propose a straightforward, novel, and training-free hardness score named Distorting-based Learning Complexity (DLC), to identify informative images and instructions from the downstream dataset efficiently. Our method is motivated by the observation that easy samples learned faster can also be learned with fewer parameters. Specifically, we define the Learning Complexity to quantify sample hardness and utilize a lightweight weights masking process for fast estimation, instead of the costly SGD optimization. Based on DLC, we further design a flexible under-sampling with randomness (dubbed FlexRand), replacing the top-K strategy, to alleviate the severe subset distribution shift. Extensive experiments with downstream image and instructions dataset pruning benchmarks demonstrate the effectiveness and efficiency of the proposed approach. In the images pruning benchmark, DLC significantly reduces the pruning time by 35x while establishing state-of-the-art performance with FlexRand.
format Preprint
id arxiv_https___arxiv_org_abs_2402_05356
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Exploring Learning Complexity for Efficient Downstream Dataset Pruning
Jiang, Wenyu
Liu, Zhenlong
Xie, Zejian
Zhang, Songxin
Jing, Bingyi
Wei, Hongxin
Machine Learning
The ever-increasing fine-tuning cost of large-scale pre-trained models gives rise to the importance of dataset pruning, which aims to reduce dataset size while maintaining task performance. However, existing dataset pruning methods require training on the entire dataset, which is impractical for large-scale pre-trained models. In this paper, we propose a straightforward, novel, and training-free hardness score named Distorting-based Learning Complexity (DLC), to identify informative images and instructions from the downstream dataset efficiently. Our method is motivated by the observation that easy samples learned faster can also be learned with fewer parameters. Specifically, we define the Learning Complexity to quantify sample hardness and utilize a lightweight weights masking process for fast estimation, instead of the costly SGD optimization. Based on DLC, we further design a flexible under-sampling with randomness (dubbed FlexRand), replacing the top-K strategy, to alleviate the severe subset distribution shift. Extensive experiments with downstream image and instructions dataset pruning benchmarks demonstrate the effectiveness and efficiency of the proposed approach. In the images pruning benchmark, DLC significantly reduces the pruning time by 35x while establishing state-of-the-art performance with FlexRand.
title Exploring Learning Complexity for Efficient Downstream Dataset Pruning
topic Machine Learning
url https://arxiv.org/abs/2402.05356