Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wu, Haodong, Zhang, Jiahao, Hu, Lijie, Zhang, Yongqi
Format:	Preprint
Published:	2026
Subjects:	Machine Learning Computation and Language
Online Access:	https://arxiv.org/abs/2605.12944
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866918499418374144
author	Wu, Haodong Zhang, Jiahao Hu, Lijie Zhang, Yongqi
author_facet	Wu, Haodong Zhang, Jiahao Hu, Lijie Zhang, Yongqi
contents	Supervised fine-tuning (SFT) data selection is commonly formulated as instance ranking: score each example and retain a top-$k$ subset. However, effective SFT training subsets are often produced through ordered curation recipes, where filtering, mixing, and deduplication operators jointly shape the final data distribution. We formulate this problem as fixed-pool data recipe search: given a raw instruction pool and a library of grounded operators, the goal is to discover an executable recipe that constructs a high-quality selected subset under a limited budget of full SFT evaluations, without generating, rewriting, or augmenting training samples. We introduce AutoSelection, a two-layer solver that decouples fixed-pool materialization based on cached task-, data-, and model-side signals from expensive full evaluation, using warmup probes, realized subset states, local recipe edits, Gaussian-process-assisted ranking, and stagnation-triggered reseeding. Experiments on a 90K instruction pool show that AutoSelection achieves the strongest in-distribution reasoning average across three base models, outperforming full-data training, random recipe search, random top-$k$, and single-operator selectors. Additional Out-of-distribution graph-reasoning results, search-stability analyses, structural ablations, and 1.5B-to-7B transfer checks further show that recipe structure matters beyond individual selection operators. Code is available at https://github.com/w253/AutoSelection.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_12944
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	From Instance Selection to Fixed-Pool Data Recipe Search for Supervised Fine-Tuning Wu, Haodong Zhang, Jiahao Hu, Lijie Zhang, Yongqi Machine Learning Computation and Language Supervised fine-tuning (SFT) data selection is commonly formulated as instance ranking: score each example and retain a top-$k$ subset. However, effective SFT training subsets are often produced through ordered curation recipes, where filtering, mixing, and deduplication operators jointly shape the final data distribution. We formulate this problem as fixed-pool data recipe search: given a raw instruction pool and a library of grounded operators, the goal is to discover an executable recipe that constructs a high-quality selected subset under a limited budget of full SFT evaluations, without generating, rewriting, or augmenting training samples. We introduce AutoSelection, a two-layer solver that decouples fixed-pool materialization based on cached task-, data-, and model-side signals from expensive full evaluation, using warmup probes, realized subset states, local recipe edits, Gaussian-process-assisted ranking, and stagnation-triggered reseeding. Experiments on a 90K instruction pool show that AutoSelection achieves the strongest in-distribution reasoning average across three base models, outperforming full-data training, random recipe search, random top-$k$, and single-operator selectors. Additional Out-of-distribution graph-reasoning results, search-stability analyses, structural ablations, and 1.5B-to-7B transfer checks further show that recipe structure matters beyond individual selection operators. Code is available at https://github.com/w253/AutoSelection.
title	From Instance Selection to Fixed-Pool Data Recipe Search for Supervised Fine-Tuning
topic	Machine Learning Computation and Language
url	https://arxiv.org/abs/2605.12944

Similar Items