Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Sun, Jianwei, Mei, Chaoyang, Wei, Linlin, Zheng, Kaiyu, Liu, Na, Cui, Ming, Li, Tianyi
Format:	Preprint
Published:	2024
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2403.09167
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911797800337408
author	Sun, Jianwei Mei, Chaoyang Wei, Linlin Zheng, Kaiyu Liu, Na Cui, Ming Li, Tianyi
author_facet	Sun, Jianwei Mei, Chaoyang Wei, Linlin Zheng, Kaiyu Liu, Na Cui, Ming Li, Tianyi
contents	The efficacy of large language models (LLMs) is heavily dependent on the quality of the underlying data, particularly within specialized domains. A common challenge when fine-tuning LLMs for domain-specific applications is the potential degradation of the model's generalization capabilities. To address these issues, we propose a two-stage approach for the construction of production prompts designed to yield high-quality data. This method involves the generation of a diverse array of prompts that encompass a broad spectrum of tasks and exhibit a rich variety of expressions. Furthermore, we introduce a cost-effective, multi-dimensional quality assessment framework to ensure the integrity of the generated labeling data. Utilizing a dataset comprised of service provider and customer interactions from the real estate sector, we demonstrate a positive correlation between data quality and model performance. Notably, our findings indicate that the domain-specific proficiency of general LLMs can be enhanced through fine-tuning with data produced via our proposed method, without compromising their overall generalization abilities, even when exclusively domain-specific data is employed for fine-tuning.
format	Preprint
id	arxiv_https___arxiv_org_abs_2403_09167
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Dial-insight: Fine-tuning Large Language Models with High-Quality Domain-Specific Data Preventing Capability Collapse Sun, Jianwei Mei, Chaoyang Wei, Linlin Zheng, Kaiyu Liu, Na Cui, Ming Li, Tianyi Computation and Language The efficacy of large language models (LLMs) is heavily dependent on the quality of the underlying data, particularly within specialized domains. A common challenge when fine-tuning LLMs for domain-specific applications is the potential degradation of the model's generalization capabilities. To address these issues, we propose a two-stage approach for the construction of production prompts designed to yield high-quality data. This method involves the generation of a diverse array of prompts that encompass a broad spectrum of tasks and exhibit a rich variety of expressions. Furthermore, we introduce a cost-effective, multi-dimensional quality assessment framework to ensure the integrity of the generated labeling data. Utilizing a dataset comprised of service provider and customer interactions from the real estate sector, we demonstrate a positive correlation between data quality and model performance. Notably, our findings indicate that the domain-specific proficiency of general LLMs can be enhanced through fine-tuning with data produced via our proposed method, without compromising their overall generalization abilities, even when exclusively domain-specific data is employed for fine-tuning.
title	Dial-insight: Fine-tuning Large Language Models with High-Quality Domain-Specific Data Preventing Capability Collapse
topic	Computation and Language
url	https://arxiv.org/abs/2403.09167

Similar Items