Enregistré dans:
| Auteurs principaux: | , , , , |
|---|---|
| Format: | Preprint |
| Publié: |
2024
|
| Sujets: | |
| Accès en ligne: | https://arxiv.org/abs/2410.02755 |
| Tags: |
Ajouter un tag
Pas de tags, Soyez le premier à ajouter un tag!
|
| _version_ | 1866912212318158848 |
|---|---|
| author | Zhang, Jifan Luo, Ziyue Liu, Jia Shroff, Ness Nowak, Robert |
| author_facet | Zhang, Jifan Luo, Ziyue Liu, Jia Shroff, Ness Nowak, Robert |
| contents | Large language models require vast amounts of high-quality training data, but effective filtering of web-scale datasets remains a significant challenge. This paper demonstrates that GPT-4o is remarkably effective at identifying high-quality training data, but its prohibitive cost makes it impractical at web-scale. We propose SIEVE, a lightweight alternative that matches GPT-4o accuracy at less than 1\% of the cost. SIEVE can perform up to 500 filtering operations for the cost of one GPT-4o filtering call. The key to SIEVE is a seamless integration of GPT-4o and lightweight text classification models, using active learning to fine-tune these models in the background with a small number of calls to GPT-4o. Once trained, it performs as well as GPT-4o at a tiny fraction of the cost. Through different filtering prompts, SIEVE can efficiently curate high quality data for general or specialized domains from web-scale corpora -- a valuable capability given the current scarcity of high-quality domain-specific datasets. Extensive experiments using automatic and human evaluation metrics show that SIEVE and GPT-4o achieve similar performance on five highly specific filtering prompts. In addition, when performing quality filtering on web crawl datasets, we demonstrate SIEVE can further improve over state-of-the-art quality filtering methods in the DataComp-LM challenge for selecting LLM pretraining data. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2410_02755 |
| institution | arXiv |
| publishDate | 2024 |
| record_format | arxiv |
| spellingShingle | GPT-4o as the Gold Standard: A Scalable and General Purpose Approach to Filter Language Model Pretraining Data Zhang, Jifan Luo, Ziyue Liu, Jia Shroff, Ness Nowak, Robert Computation and Language Machine Learning Large language models require vast amounts of high-quality training data, but effective filtering of web-scale datasets remains a significant challenge. This paper demonstrates that GPT-4o is remarkably effective at identifying high-quality training data, but its prohibitive cost makes it impractical at web-scale. We propose SIEVE, a lightweight alternative that matches GPT-4o accuracy at less than 1\% of the cost. SIEVE can perform up to 500 filtering operations for the cost of one GPT-4o filtering call. The key to SIEVE is a seamless integration of GPT-4o and lightweight text classification models, using active learning to fine-tune these models in the background with a small number of calls to GPT-4o. Once trained, it performs as well as GPT-4o at a tiny fraction of the cost. Through different filtering prompts, SIEVE can efficiently curate high quality data for general or specialized domains from web-scale corpora -- a valuable capability given the current scarcity of high-quality domain-specific datasets. Extensive experiments using automatic and human evaluation metrics show that SIEVE and GPT-4o achieve similar performance on five highly specific filtering prompts. In addition, when performing quality filtering on web crawl datasets, we demonstrate SIEVE can further improve over state-of-the-art quality filtering methods in the DataComp-LM challenge for selecting LLM pretraining data. |
| title | GPT-4o as the Gold Standard: A Scalable and General Purpose Approach to Filter Language Model Pretraining Data |
| topic | Computation and Language Machine Learning |
| url | https://arxiv.org/abs/2410.02755 |