Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Nayak, Nihal V., Rodriguez-Diaz, Paula, Hulkund, Neha, Beery, Sara, Alvarez-Melis, David
Format:	Preprint
Published:	2026
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2602.14696
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912907256659968
author	Nayak, Nihal V. Rodriguez-Diaz, Paula Hulkund, Neha Beery, Sara Alvarez-Melis, David
author_facet	Nayak, Nihal V. Rodriguez-Diaz, Paula Hulkund, Neha Beery, Sara Alvarez-Melis, David
contents	Instruction fine-tuning of large language models (LLMs) often involves selecting a subset of instruction training data from a large candidate pool, using a small query set from the target task. Despite growing interest, the literature on targeted instruction selection remains fragmented and opaque: methods vary widely in selection budgets, often omit zero-shot baselines, and frequently entangle the contributions of key components. As a result, practitioners lack actionable guidance on selecting instructions for their target tasks. In this work, we aim to bring clarity to this landscape by disentangling and systematically analyzing the two core ingredients: data representation and selection algorithms. Our framework enables controlled comparisons across models, tasks, and budgets. We find that only gradient-based data representations choose subsets whose similarity to the query consistently predicts performance across datasets and models. While no single method dominates, gradient-based representations paired with a greedy round-robin selection algorithm tend to perform best on average at low budgets, but these benefits diminish at larger budgets. Finally, we unify several existing selection algorithms as forms of approximate distance minimization between the selected subset and the query set, and support this view with new generalization bounds. More broadly, our findings provide critical insights and a foundation for more principled data selection in LLM fine-tuning. The code is available at https://github.com/dcml-lab/targeted-instruction-selection.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_14696
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	A Critical Look at Targeted Instruction Selection: Disentangling What Matters (and What Doesn't) Nayak, Nihal V. Rodriguez-Diaz, Paula Hulkund, Neha Beery, Sara Alvarez-Melis, David Machine Learning Instruction fine-tuning of large language models (LLMs) often involves selecting a subset of instruction training data from a large candidate pool, using a small query set from the target task. Despite growing interest, the literature on targeted instruction selection remains fragmented and opaque: methods vary widely in selection budgets, often omit zero-shot baselines, and frequently entangle the contributions of key components. As a result, practitioners lack actionable guidance on selecting instructions for their target tasks. In this work, we aim to bring clarity to this landscape by disentangling and systematically analyzing the two core ingredients: data representation and selection algorithms. Our framework enables controlled comparisons across models, tasks, and budgets. We find that only gradient-based data representations choose subsets whose similarity to the query consistently predicts performance across datasets and models. While no single method dominates, gradient-based representations paired with a greedy round-robin selection algorithm tend to perform best on average at low budgets, but these benefits diminish at larger budgets. Finally, we unify several existing selection algorithms as forms of approximate distance minimization between the selected subset and the query set, and support this view with new generalization bounds. More broadly, our findings provide critical insights and a foundation for more principled data selection in LLM fine-tuning. The code is available at https://github.com/dcml-lab/targeted-instruction-selection.
title	A Critical Look at Targeted Instruction Selection: Disentangling What Matters (and What Doesn't)
topic	Machine Learning
url	https://arxiv.org/abs/2602.14696

Similar Items