Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wu, Changti, Mao, Jiahuai, Miao, Yuzhuo, Lian, Shijie, Yu, Bin, Lin, Xiaopeng, Huang, Cong, Zhang, Lei, Chen, Kai
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2602.11636
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910020033052672
author	Wu, Changti Mao, Jiahuai Miao, Yuzhuo Lian, Shijie Yu, Bin Lin, Xiaopeng Huang, Cong Zhang, Lei Chen, Kai
author_facet	Wu, Changti Mao, Jiahuai Miao, Yuzhuo Lian, Shijie Yu, Bin Lin, Xiaopeng Huang, Cong Zhang, Lei Chen, Kai
contents	Large-scale Visual Instruction Tuning (VIT) has become a key paradigm for advancing the performance of vision-language models (VLMs) across various multimodal tasks. However, training on the large-scale datasets is computationally expensive and inefficient due to redundancy in the data, which motivates the need for multimodal data selection to improve training efficiency. Existing data selection methods for VIT either require costly training or gradient computation. Training-free alternatives often depend on proxy models or datasets, instruction-agnostic representations, and pairwise similarity with quadratic complexity, limiting scalability and representation fidelity. In this work, we propose ScalSelect, a scalable training-free multimodal data selection method with linear-time complexity with respect to the number of samples, eliminating the need for external models or auxiliary datasets. ScalSelect first constructs sample representations by extracting visual features most attended by instruction tokens in the target VLM, capturing instruction-relevant information. It then identifies samples whose representations best approximate the dominant subspace of the full dataset representations, enabling scalable importance scoring without pairwise comparisons. Extensive experiments across multiple VLMs, datasets, and selection budgets demonstrate that ScalSelect achieves over 97.5% of the performance of training on the full dataset using only 16% of the data, and even outperforms full-data training in some settings. The code is available at \href{https://github.com/ChangtiWu/ScalSelect}{ScalSelect}.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_11636
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	ScalSelect: Scalable Training-Free Multimodal Data Selection for Efficient Visual Instruction Tuning Wu, Changti Mao, Jiahuai Miao, Yuzhuo Lian, Shijie Yu, Bin Lin, Xiaopeng Huang, Cong Zhang, Lei Chen, Kai Computer Vision and Pattern Recognition Artificial Intelligence Large-scale Visual Instruction Tuning (VIT) has become a key paradigm for advancing the performance of vision-language models (VLMs) across various multimodal tasks. However, training on the large-scale datasets is computationally expensive and inefficient due to redundancy in the data, which motivates the need for multimodal data selection to improve training efficiency. Existing data selection methods for VIT either require costly training or gradient computation. Training-free alternatives often depend on proxy models or datasets, instruction-agnostic representations, and pairwise similarity with quadratic complexity, limiting scalability and representation fidelity. In this work, we propose ScalSelect, a scalable training-free multimodal data selection method with linear-time complexity with respect to the number of samples, eliminating the need for external models or auxiliary datasets. ScalSelect first constructs sample representations by extracting visual features most attended by instruction tokens in the target VLM, capturing instruction-relevant information. It then identifies samples whose representations best approximate the dominant subspace of the full dataset representations, enabling scalable importance scoring without pairwise comparisons. Extensive experiments across multiple VLMs, datasets, and selection budgets demonstrate that ScalSelect achieves over 97.5% of the performance of training on the full dataset using only 16% of the data, and even outperforms full-data training in some settings. The code is available at \href{https://github.com/ChangtiWu/ScalSelect}{ScalSelect}.
title	ScalSelect: Scalable Training-Free Multimodal Data Selection for Efficient Visual Instruction Tuning
topic	Computer Vision and Pattern Recognition Artificial Intelligence
url	https://arxiv.org/abs/2602.11636

Similar Items