Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Toibazar, Daulet, Wang, Kesen, Mohamed, Sherif, Al-Badawi, Abdulaziz, Alfulayt, Abdulrahman, Moreno, Pedro J.
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2507.20156
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916866901934080
author	Toibazar, Daulet Wang, Kesen Mohamed, Sherif Al-Badawi, Abdulaziz Alfulayt, Abdulrahman Moreno, Pedro J.
author_facet	Toibazar, Daulet Wang, Kesen Mohamed, Sherif Al-Badawi, Abdulaziz Alfulayt, Abdulrahman Moreno, Pedro J.
contents	Vision-language models (VLMs) extend the conventional large language models by integrating visual data, enabling richer multimodal reasoning and significantly broadens the practical applications of AI. However, including visual inputs also brings new challenges in maintaining data quality. Empirical evidence consistently shows that carefully curated and representative training examples often yield superior results compared to simply increasing the quantity of data. Inspired by this observation, we introduce a streamlined data filtration framework that employs a compact VLM, fine-tuned on a high-quality image-caption annotated dataset. This model effectively evaluates and filters potential training samples based on caption and image quality and alignment. Unlike previous approaches, which typically add auxiliary filtration modules on top of existing full-scale VLMs, our method exclusively utilizes the inherent evaluative capability of a purpose-built small VLM. This strategy eliminates the need for extra modules and reduces training overhead. Our lightweight model efficiently filters out inaccurate, noisy web data, improving image-text alignment and caption linguistic fluency. Experimental results show that datasets underwent high-precision filtration using our compact VLM perform on par with, or even surpass, larger and noisier datasets gathered through high-volume web crawling. Thus, our method provides a lightweight yet robust solution for building high-quality vision-language training corpora. \\ \textbf{Availability and implementation:} Our compact VLM filtration model, training data, utility scripts, and Supplementary data (Appendices) are freely available at https://github.com/daulettoibazar/Compact_VLM_Filter.
format	Preprint
id	arxiv_https___arxiv_org_abs_2507_20156
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Trust the Model: Compact VLMs as In-Context Judges for Image-Text Data Quality Toibazar, Daulet Wang, Kesen Mohamed, Sherif Al-Badawi, Abdulaziz Alfulayt, Abdulrahman Moreno, Pedro J. Computer Vision and Pattern Recognition Artificial Intelligence Vision-language models (VLMs) extend the conventional large language models by integrating visual data, enabling richer multimodal reasoning and significantly broadens the practical applications of AI. However, including visual inputs also brings new challenges in maintaining data quality. Empirical evidence consistently shows that carefully curated and representative training examples often yield superior results compared to simply increasing the quantity of data. Inspired by this observation, we introduce a streamlined data filtration framework that employs a compact VLM, fine-tuned on a high-quality image-caption annotated dataset. This model effectively evaluates and filters potential training samples based on caption and image quality and alignment. Unlike previous approaches, which typically add auxiliary filtration modules on top of existing full-scale VLMs, our method exclusively utilizes the inherent evaluative capability of a purpose-built small VLM. This strategy eliminates the need for extra modules and reduces training overhead. Our lightweight model efficiently filters out inaccurate, noisy web data, improving image-text alignment and caption linguistic fluency. Experimental results show that datasets underwent high-precision filtration using our compact VLM perform on par with, or even surpass, larger and noisier datasets gathered through high-volume web crawling. Thus, our method provides a lightweight yet robust solution for building high-quality vision-language training corpora. \\ \textbf{Availability and implementation:} Our compact VLM filtration model, training data, utility scripts, and Supplementary data (Appendices) are freely available at https://github.com/daulettoibazar/Compact_VLM_Filter.
title	Trust the Model: Compact VLMs as In-Context Judges for Image-Text Data Quality
topic	Computer Vision and Pattern Recognition Artificial Intelligence
url	https://arxiv.org/abs/2507.20156

Similar Items