Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Shu, Youwei, Zheng, Shaomian, Jin, Dingnan, Qu, Wenjie, Guo, Ziyao, Cui, Qing, Zhou, Jun, Zhang, Jiaheng
Format:	Preprint
Published:	2026
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2602.13773
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908834672410624
author	Shu, Youwei Zheng, Shaomian Jin, Dingnan Qu, Wenjie Guo, Ziyao Cui, Qing Zhou, Jun Zhang, Jiaheng
author_facet	Shu, Youwei Zheng, Shaomian Jin, Dingnan Qu, Wenjie Guo, Ziyao Cui, Qing Zhou, Jun Zhang, Jiaheng
contents	Data quality is a crucial factor in large language models training. While prior work has shown that models trained on smaller, high-quality datasets can outperform those trained on much larger but noisy or low-quality corpora, systematic methods for industrial-scale data selection in instruction tuning remain underexplored. In this work, we study instruction-tuning data selection through the lens of semantic representation similarity and identify a key limitation of state-of-the-art LLM encoders: they produce highly redundant semantic embeddings. To mitigate this redundancy, we propose Compressed Representation Data Selection (CRDS), a novel framework with two variants. CRDS-R applies Rademacher random projection followed by concatenation of transformer hidden-layer representations, while CRDS-W employs whitening-based dimensionality reduction to improve representational quality. Experimental results demonstrate that both variants substantially enhance data quality and consistently outperform state-of-the-art representation-based selection methods. Notably, CRDS-W achieves strong performance using only 3.5% of the data, surpassing the full-data baseline by an average of 0.71% across four datasets. Our code is available at https://github.com/tdano1/CRDS.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_13773
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	On Representation Redundancy in Large-Scale Instruction Tuning Data Selection Shu, Youwei Zheng, Shaomian Jin, Dingnan Qu, Wenjie Guo, Ziyao Cui, Qing Zhou, Jun Zhang, Jiaheng Machine Learning Data quality is a crucial factor in large language models training. While prior work has shown that models trained on smaller, high-quality datasets can outperform those trained on much larger but noisy or low-quality corpora, systematic methods for industrial-scale data selection in instruction tuning remain underexplored. In this work, we study instruction-tuning data selection through the lens of semantic representation similarity and identify a key limitation of state-of-the-art LLM encoders: they produce highly redundant semantic embeddings. To mitigate this redundancy, we propose Compressed Representation Data Selection (CRDS), a novel framework with two variants. CRDS-R applies Rademacher random projection followed by concatenation of transformer hidden-layer representations, while CRDS-W employs whitening-based dimensionality reduction to improve representational quality. Experimental results demonstrate that both variants substantially enhance data quality and consistently outperform state-of-the-art representation-based selection methods. Notably, CRDS-W achieves strong performance using only 3.5% of the data, surpassing the full-data baseline by an average of 0.71% across four datasets. Our code is available at https://github.com/tdano1/CRDS.
title	On Representation Redundancy in Large-Scale Instruction Tuning Data Selection
topic	Machine Learning
url	https://arxiv.org/abs/2602.13773

Similar Items