Affichage MARC: :: Library Catalog

Enregistré dans:

Détails bibliographiques
Auteurs principaux:	Zhang, Jifan, Luo, Ziyue, Liu, Jia, Shroff, Ness, Nowak, Robert
Format:	Preprint
Publié:	2024
Sujets:	Computation and Language Machine Learning
Accès en ligne:	https://arxiv.org/abs/2410.02755
Tags:	Ajouter un tag Pas de tags, Soyez le premier à ajouter un tag!

_version_	1866912212318158848
author	Zhang, Jifan Luo, Ziyue Liu, Jia Shroff, Ness Nowak, Robert
author_facet	Zhang, Jifan Luo, Ziyue Liu, Jia Shroff, Ness Nowak, Robert
contents	Large language models require vast amounts of high-quality training data, but effective filtering of web-scale datasets remains a significant challenge. This paper demonstrates that GPT-4o is remarkably effective at identifying high-quality training data, but its prohibitive cost makes it impractical at web-scale. We propose SIEVE, a lightweight alternative that matches GPT-4o accuracy at less than 1\% of the cost. SIEVE can perform up to 500 filtering operations for the cost of one GPT-4o filtering call. The key to SIEVE is a seamless integration of GPT-4o and lightweight text classification models, using active learning to fine-tune these models in the background with a small number of calls to GPT-4o. Once trained, it performs as well as GPT-4o at a tiny fraction of the cost. Through different filtering prompts, SIEVE can efficiently curate high quality data for general or specialized domains from web-scale corpora -- a valuable capability given the current scarcity of high-quality domain-specific datasets. Extensive experiments using automatic and human evaluation metrics show that SIEVE and GPT-4o achieve similar performance on five highly specific filtering prompts. In addition, when performing quality filtering on web crawl datasets, we demonstrate SIEVE can further improve over state-of-the-art quality filtering methods in the DataComp-LM challenge for selecting LLM pretraining data.
format	Preprint
id	arxiv_https___arxiv_org_abs_2410_02755
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	GPT-4o as the Gold Standard: A Scalable and General Purpose Approach to Filter Language Model Pretraining Data Zhang, Jifan Luo, Ziyue Liu, Jia Shroff, Ness Nowak, Robert Computation and Language Machine Learning Large language models require vast amounts of high-quality training data, but effective filtering of web-scale datasets remains a significant challenge. This paper demonstrates that GPT-4o is remarkably effective at identifying high-quality training data, but its prohibitive cost makes it impractical at web-scale. We propose SIEVE, a lightweight alternative that matches GPT-4o accuracy at less than 1\% of the cost. SIEVE can perform up to 500 filtering operations for the cost of one GPT-4o filtering call. The key to SIEVE is a seamless integration of GPT-4o and lightweight text classification models, using active learning to fine-tune these models in the background with a small number of calls to GPT-4o. Once trained, it performs as well as GPT-4o at a tiny fraction of the cost. Through different filtering prompts, SIEVE can efficiently curate high quality data for general or specialized domains from web-scale corpora -- a valuable capability given the current scarcity of high-quality domain-specific datasets. Extensive experiments using automatic and human evaluation metrics show that SIEVE and GPT-4o achieve similar performance on five highly specific filtering prompts. In addition, when performing quality filtering on web crawl datasets, we demonstrate SIEVE can further improve over state-of-the-art quality filtering methods in the DataComp-LM challenge for selecting LLM pretraining data.
title	GPT-4o as the Gold Standard: A Scalable and General Purpose Approach to Filter Language Model Pretraining Data
topic	Computation and Language Machine Learning
url	https://arxiv.org/abs/2410.02755

Documents similaires