Saved in:
| Main Authors: | , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2509.06524 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866915484205580288 |
|---|---|
| author | Wu, Jian Yu, Hang Liu, Bingchang Yang, Wenjie Di, Peng Li, Jianguo Zhang, Yue |
| author_facet | Wu, Jian Yu, Hang Liu, Bingchang Yang, Wenjie Di, Peng Li, Jianguo Zhang, Yue |
| contents | Adapting large language models (LLMs) to specific domains often faces a critical bottleneck: the scarcity of high-quality, human-curated data. While large volumes of unchecked data are readily available, indiscriminately using them for fine-tuning risks introducing noise and degrading performance. Strategic data selection is thus crucial, requiring a method that is both accurate and efficient. Existing approaches, categorized as similarity-based and direct optimization methods, struggle to simultaneously achieve these goals. In this paper, we introduce LAMDAS (LLM As an iMplicit classifier for domain-specific DAta Selection), a novel approach that leverages the pre-trained LLM itself as an implicit classifier, thereby bypassing explicit feature engineering and computationally intensive optimization process. LAMDAS reframes data selection as a one-class classification problem, identifying candidate data that "belongs" to the target domain defined by a small reference dataset. Extensive experimental results demonstrate that LAMDAS not only exceeds the performance of full-data training using a fraction of the data but also outperforms nine state-of-the-art (SOTA) baselines under various scenarios. Furthermore, LAMDAS achieves the most compelling balance between performance gains and computational efficiency compared to all evaluated baselines. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2509_06524 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | LAMDAS: LLM as an Implicit Classifier for Domain-specific Data Selection Wu, Jian Yu, Hang Liu, Bingchang Yang, Wenjie Di, Peng Li, Jianguo Zhang, Yue Computation and Language Adapting large language models (LLMs) to specific domains often faces a critical bottleneck: the scarcity of high-quality, human-curated data. While large volumes of unchecked data are readily available, indiscriminately using them for fine-tuning risks introducing noise and degrading performance. Strategic data selection is thus crucial, requiring a method that is both accurate and efficient. Existing approaches, categorized as similarity-based and direct optimization methods, struggle to simultaneously achieve these goals. In this paper, we introduce LAMDAS (LLM As an iMplicit classifier for domain-specific DAta Selection), a novel approach that leverages the pre-trained LLM itself as an implicit classifier, thereby bypassing explicit feature engineering and computationally intensive optimization process. LAMDAS reframes data selection as a one-class classification problem, identifying candidate data that "belongs" to the target domain defined by a small reference dataset. Extensive experimental results demonstrate that LAMDAS not only exceeds the performance of full-data training using a fraction of the data but also outperforms nine state-of-the-art (SOTA) baselines under various scenarios. Furthermore, LAMDAS achieves the most compelling balance between performance gains and computational efficiency compared to all evaluated baselines. |
| title | LAMDAS: LLM as an Implicit Classifier for Domain-specific Data Selection |
| topic | Computation and Language |
| url | https://arxiv.org/abs/2509.06524 |