Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Kotecha, Madhav, Vaishya, Vijendra Kumar, Gautam, Smita, Racha, Suraj
Format:	Preprint
Published:	2025
Subjects:	Machine Learning Artificial Intelligence 68T05
Online Access:	https://arxiv.org/abs/2505.01523
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915271207288832
author	Kotecha, Madhav Vaishya, Vijendra Kumar Gautam, Smita Racha, Suraj
author_facet	Kotecha, Madhav Vaishya, Vijendra Kumar Gautam, Smita Racha, Suraj
contents	We propose a refined approach to efficiently fine-tune large language models (LLMs) on specific domains like the mathematical domain by employing a budgeted subset selection method. Our approach combines utility and diversity metrics to select the most informative and representative training examples. The final goal is to achieve near-full dataset performance with meticulously selected data points from the entire dataset while significantly reducing computational cost and training time and achieving competitive performance as the full dataset. The utility metric incorporates both perplexity and Chain-of-Thought (CoT) loss to identify challenging examples that contribute most to model learning, while the diversity metric ensures broad coverage across mathematical subdomains. We evaluate our method on LLaMA-3 8B and Phi-3 models, comparing against several baseline approaches, including random selection, diversity-based sampling, and existing state-of-the-art subset selection techniques.
format	Preprint
id	arxiv_https___arxiv_org_abs_2505_01523
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Subset Selection for Fine-Tuning: A Utility-Diversity Balanced Approach for Mathematical Domain Adaptation Kotecha, Madhav Vaishya, Vijendra Kumar Gautam, Smita Racha, Suraj Machine Learning Artificial Intelligence 68T05 We propose a refined approach to efficiently fine-tune large language models (LLMs) on specific domains like the mathematical domain by employing a budgeted subset selection method. Our approach combines utility and diversity metrics to select the most informative and representative training examples. The final goal is to achieve near-full dataset performance with meticulously selected data points from the entire dataset while significantly reducing computational cost and training time and achieving competitive performance as the full dataset. The utility metric incorporates both perplexity and Chain-of-Thought (CoT) loss to identify challenging examples that contribute most to model learning, while the diversity metric ensures broad coverage across mathematical subdomains. We evaluate our method on LLaMA-3 8B and Phi-3 models, comparing against several baseline approaches, including random selection, diversity-based sampling, and existing state-of-the-art subset selection techniques.
title	Subset Selection for Fine-Tuning: A Utility-Diversity Balanced Approach for Mathematical Domain Adaptation
topic	Machine Learning Artificial Intelligence 68T05
url	https://arxiv.org/abs/2505.01523

Similar Items