Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Li, Ruihang, Wei, Yixuan, Zhang, Miaosen, Yu, Nenghai, Hu, Han, Peng, Houwen
Format:	Preprint
Published:	2024
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2408.08310
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914913496072192
author	Li, Ruihang Wei, Yixuan Zhang, Miaosen Yu, Nenghai Hu, Han Peng, Houwen
author_facet	Li, Ruihang Wei, Yixuan Zhang, Miaosen Yu, Nenghai Hu, Han Peng, Houwen
contents	High-quality data is crucial for the pre-training performance of large language models. Unfortunately, existing quality filtering methods rely on a known high-quality dataset as reference, which can introduce potential bias and compromise diversity. In this paper, we propose ScalingFilter, a novel approach that evaluates text quality based on the perplexity difference between two language models trained on the same data, thereby eliminating the influence of the reference dataset in the filtering process. An theoretical analysis shows that ScalingFilter is equivalent to an inverse utilization of scaling laws. Through training models with 1.3B parameters on the same data source processed by various quality filters, we find ScalingFilter can improve zero-shot performance of pre-trained models in downstream tasks. To assess the bias introduced by quality filtering, we introduce semantic diversity, a metric of utilizing text embedding models for semantic representations. Extensive experiments reveal that semantic diversity is a reliable indicator of dataset diversity, and ScalingFilter achieves an optimal balance between downstream performance and semantic diversity.
format	Preprint
id	arxiv_https___arxiv_org_abs_2408_08310
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	ScalingFilter: Assessing Data Quality through Inverse Utilization of Scaling Laws Li, Ruihang Wei, Yixuan Zhang, Miaosen Yu, Nenghai Hu, Han Peng, Houwen Computation and Language High-quality data is crucial for the pre-training performance of large language models. Unfortunately, existing quality filtering methods rely on a known high-quality dataset as reference, which can introduce potential bias and compromise diversity. In this paper, we propose ScalingFilter, a novel approach that evaluates text quality based on the perplexity difference between two language models trained on the same data, thereby eliminating the influence of the reference dataset in the filtering process. An theoretical analysis shows that ScalingFilter is equivalent to an inverse utilization of scaling laws. Through training models with 1.3B parameters on the same data source processed by various quality filters, we find ScalingFilter can improve zero-shot performance of pre-trained models in downstream tasks. To assess the bias introduced by quality filtering, we introduce semantic diversity, a metric of utilizing text embedding models for semantic representations. Extensive experiments reveal that semantic diversity is a reliable indicator of dataset diversity, and ScalingFilter achieves an optimal balance between downstream performance and semantic diversity.
title	ScalingFilter: Assessing Data Quality through Inverse Utilization of Scaling Laws
topic	Computation and Language
url	https://arxiv.org/abs/2408.08310

Similar Items