Enregistré dans:
Détails bibliographiques
Auteurs principaux: Joaquin, Ayrton San, Wang, Bin, Liu, Zhengyuan, Asher, Nicholas, Lim, Brian, Muller, Philippe, Chen, Nancy F.
Format: Preprint
Publié: 2024
Sujets:
Accès en ligne:https://arxiv.org/abs/2408.03560
Tags: Ajouter un tag
Pas de tags, Soyez le premier à ajouter un tag!
_version_ 1866912056276418560
author Joaquin, Ayrton San
Wang, Bin
Liu, Zhengyuan
Asher, Nicholas
Lim, Brian
Muller, Philippe
Chen, Nancy F.
author_facet Joaquin, Ayrton San
Wang, Bin
Liu, Zhengyuan
Asher, Nicholas
Lim, Brian
Muller, Philippe
Chen, Nancy F.
contents Despite advancements, fine-tuning Large Language Models (LLMs) remains costly due to the extensive parameter count and substantial data requirements for model generalization. Accessibility to computing resources remains a barrier for the open-source community. To address this challenge, we propose the In2Core algorithm, which selects a coreset by analyzing the correlation between training and evaluation samples with a trained model. Notably, we assess the model's internal gradients to estimate this relationship, aiming to rank the contribution of each training point. To enhance efficiency, we propose an optimization to compute influence functions with a reduced number of layers while achieving similar accuracy. By applying our algorithm to instruction fine-tuning data of LLMs, we can achieve similar performance with just 50% of the training data. Meantime, using influence functions to analyze model coverage to certain testing samples could provide a reliable and interpretable signal on the training set's coverage of those test points.
format Preprint
id arxiv_https___arxiv_org_abs_2408_03560
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle In2Core: Leveraging Influence Functions for Coreset Selection in Instruction Finetuning of Large Language Models
Joaquin, Ayrton San
Wang, Bin
Liu, Zhengyuan
Asher, Nicholas
Lim, Brian
Muller, Philippe
Chen, Nancy F.
Machine Learning
Despite advancements, fine-tuning Large Language Models (LLMs) remains costly due to the extensive parameter count and substantial data requirements for model generalization. Accessibility to computing resources remains a barrier for the open-source community. To address this challenge, we propose the In2Core algorithm, which selects a coreset by analyzing the correlation between training and evaluation samples with a trained model. Notably, we assess the model's internal gradients to estimate this relationship, aiming to rank the contribution of each training point. To enhance efficiency, we propose an optimization to compute influence functions with a reduced number of layers while achieving similar accuracy. By applying our algorithm to instruction fine-tuning data of LLMs, we can achieve similar performance with just 50% of the training data. Meantime, using influence functions to analyze model coverage to certain testing samples could provide a reliable and interpretable signal on the training set's coverage of those test points.
title In2Core: Leveraging Influence Functions for Coreset Selection in Instruction Finetuning of Large Language Models
topic Machine Learning
url https://arxiv.org/abs/2408.03560