Saved in:
Bibliographic Details
Main Authors: Wu, Haodong, Zhang, Jiahao, Hu, Lijie, Zhang, Yongqi
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2605.12944
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866918499418374144
author Wu, Haodong
Zhang, Jiahao
Hu, Lijie
Zhang, Yongqi
author_facet Wu, Haodong
Zhang, Jiahao
Hu, Lijie
Zhang, Yongqi
contents Supervised fine-tuning (SFT) data selection is commonly formulated as instance ranking: score each example and retain a top-$k$ subset. However, effective SFT training subsets are often produced through ordered curation recipes, where filtering, mixing, and deduplication operators jointly shape the final data distribution. We formulate this problem as fixed-pool data recipe search: given a raw instruction pool and a library of grounded operators, the goal is to discover an executable recipe that constructs a high-quality selected subset under a limited budget of full SFT evaluations, without generating, rewriting, or augmenting training samples. We introduce AutoSelection, a two-layer solver that decouples fixed-pool materialization based on cached task-, data-, and model-side signals from expensive full evaluation, using warmup probes, realized subset states, local recipe edits, Gaussian-process-assisted ranking, and stagnation-triggered reseeding. Experiments on a 90K instruction pool show that AutoSelection achieves the strongest in-distribution reasoning average across three base models, outperforming full-data training, random recipe search, random top-$k$, and single-operator selectors. Additional Out-of-distribution graph-reasoning results, search-stability analyses, structural ablations, and 1.5B-to-7B transfer checks further show that recipe structure matters beyond individual selection operators. Code is available at https://github.com/w253/AutoSelection.
format Preprint
id arxiv_https___arxiv_org_abs_2605_12944
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle From Instance Selection to Fixed-Pool Data Recipe Search for Supervised Fine-Tuning
Wu, Haodong
Zhang, Jiahao
Hu, Lijie
Zhang, Yongqi
Machine Learning
Computation and Language
Supervised fine-tuning (SFT) data selection is commonly formulated as instance ranking: score each example and retain a top-$k$ subset. However, effective SFT training subsets are often produced through ordered curation recipes, where filtering, mixing, and deduplication operators jointly shape the final data distribution. We formulate this problem as fixed-pool data recipe search: given a raw instruction pool and a library of grounded operators, the goal is to discover an executable recipe that constructs a high-quality selected subset under a limited budget of full SFT evaluations, without generating, rewriting, or augmenting training samples. We introduce AutoSelection, a two-layer solver that decouples fixed-pool materialization based on cached task-, data-, and model-side signals from expensive full evaluation, using warmup probes, realized subset states, local recipe edits, Gaussian-process-assisted ranking, and stagnation-triggered reseeding. Experiments on a 90K instruction pool show that AutoSelection achieves the strongest in-distribution reasoning average across three base models, outperforming full-data training, random recipe search, random top-$k$, and single-operator selectors. Additional Out-of-distribution graph-reasoning results, search-stability analyses, structural ablations, and 1.5B-to-7B transfer checks further show that recipe structure matters beyond individual selection operators. Code is available at https://github.com/w253/AutoSelection.
title From Instance Selection to Fixed-Pool Data Recipe Search for Supervised Fine-Tuning
topic Machine Learning
Computation and Language
url https://arxiv.org/abs/2605.12944