Saved in:
Bibliographic Details
Main Authors: Sun, Jianwei, Mei, Chaoyang, Wei, Linlin, Zheng, Kaiyu, Liu, Na, Cui, Ming, Li, Tianyi
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2403.09167
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911797800337408
author Sun, Jianwei
Mei, Chaoyang
Wei, Linlin
Zheng, Kaiyu
Liu, Na
Cui, Ming
Li, Tianyi
author_facet Sun, Jianwei
Mei, Chaoyang
Wei, Linlin
Zheng, Kaiyu
Liu, Na
Cui, Ming
Li, Tianyi
contents The efficacy of large language models (LLMs) is heavily dependent on the quality of the underlying data, particularly within specialized domains. A common challenge when fine-tuning LLMs for domain-specific applications is the potential degradation of the model's generalization capabilities. To address these issues, we propose a two-stage approach for the construction of production prompts designed to yield high-quality data. This method involves the generation of a diverse array of prompts that encompass a broad spectrum of tasks and exhibit a rich variety of expressions. Furthermore, we introduce a cost-effective, multi-dimensional quality assessment framework to ensure the integrity of the generated labeling data. Utilizing a dataset comprised of service provider and customer interactions from the real estate sector, we demonstrate a positive correlation between data quality and model performance. Notably, our findings indicate that the domain-specific proficiency of general LLMs can be enhanced through fine-tuning with data produced via our proposed method, without compromising their overall generalization abilities, even when exclusively domain-specific data is employed for fine-tuning.
format Preprint
id arxiv_https___arxiv_org_abs_2403_09167
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Dial-insight: Fine-tuning Large Language Models with High-Quality Domain-Specific Data Preventing Capability Collapse
Sun, Jianwei
Mei, Chaoyang
Wei, Linlin
Zheng, Kaiyu
Liu, Na
Cui, Ming
Li, Tianyi
Computation and Language
The efficacy of large language models (LLMs) is heavily dependent on the quality of the underlying data, particularly within specialized domains. A common challenge when fine-tuning LLMs for domain-specific applications is the potential degradation of the model's generalization capabilities. To address these issues, we propose a two-stage approach for the construction of production prompts designed to yield high-quality data. This method involves the generation of a diverse array of prompts that encompass a broad spectrum of tasks and exhibit a rich variety of expressions. Furthermore, we introduce a cost-effective, multi-dimensional quality assessment framework to ensure the integrity of the generated labeling data. Utilizing a dataset comprised of service provider and customer interactions from the real estate sector, we demonstrate a positive correlation between data quality and model performance. Notably, our findings indicate that the domain-specific proficiency of general LLMs can be enhanced through fine-tuning with data produced via our proposed method, without compromising their overall generalization abilities, even when exclusively domain-specific data is employed for fine-tuning.
title Dial-insight: Fine-tuning Large Language Models with High-Quality Domain-Specific Data Preventing Capability Collapse
topic Computation and Language
url https://arxiv.org/abs/2403.09167