Saved in:
Bibliographic Details
Main Authors: Klimaszewski, Mateusz, Andruszkiewicz, Piotr
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2605.23721
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866917524861353984
author Klimaszewski, Mateusz
Andruszkiewicz, Piotr
author_facet Klimaszewski, Mateusz
Andruszkiewicz, Piotr
contents Classifier-based Quality Filtering has recently emerged as a fundamental technique in constructing pre-training corpora. The ability to deploy a single model that can replace or supplement a set of heuristics has proven effective across numerous Large Language Models. In this work, we expose a critical vulnerability in this approach by demonstrating how a straightforward Wikipedia-style reformatting operation can substantially alter a model's quality assessment and enable low-quality content to surpass filtering thresholds. Our analysis reveals that the FineWeb-Edu CQF model would reverse its filtering decision for approximately 7% of evaluated documents, thereby admitting content into the pre-training corpus that would otherwise have been excluded.
format Preprint
id arxiv_https___arxiv_org_abs_2605_23721
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Is a Document Educational or Just Wikipedia-Style? -- Pitfalls of Classifier-Based Quality Filtering
Klimaszewski, Mateusz
Andruszkiewicz, Piotr
Computation and Language
Classifier-based Quality Filtering has recently emerged as a fundamental technique in constructing pre-training corpora. The ability to deploy a single model that can replace or supplement a set of heuristics has proven effective across numerous Large Language Models. In this work, we expose a critical vulnerability in this approach by demonstrating how a straightforward Wikipedia-style reformatting operation can substantially alter a model's quality assessment and enable low-quality content to surpass filtering thresholds. Our analysis reveals that the FineWeb-Edu CQF model would reverse its filtering decision for approximately 7% of evaluated documents, thereby admitting content into the pre-training corpus that would otherwise have been excluded.
title Is a Document Educational or Just Wikipedia-Style? -- Pitfalls of Classifier-Based Quality Filtering
topic Computation and Language
url https://arxiv.org/abs/2605.23721