Saved in:
| Main Authors: | , , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.13773 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866908834672410624 |
|---|---|
| author | Shu, Youwei Zheng, Shaomian Jin, Dingnan Qu, Wenjie Guo, Ziyao Cui, Qing Zhou, Jun Zhang, Jiaheng |
| author_facet | Shu, Youwei Zheng, Shaomian Jin, Dingnan Qu, Wenjie Guo, Ziyao Cui, Qing Zhou, Jun Zhang, Jiaheng |
| contents | Data quality is a crucial factor in large language models training. While prior work has shown that models trained on smaller, high-quality datasets can outperform those trained on much larger but noisy or low-quality corpora, systematic methods for industrial-scale data selection in instruction tuning remain underexplored. In this work, we study instruction-tuning data selection through the lens of semantic representation similarity and identify a key limitation of state-of-the-art LLM encoders: they produce highly redundant semantic embeddings. To mitigate this redundancy, we propose Compressed Representation Data Selection (CRDS), a novel framework with two variants. CRDS-R applies Rademacher random projection followed by concatenation of transformer hidden-layer representations, while CRDS-W employs whitening-based dimensionality reduction to improve representational quality. Experimental results demonstrate that both variants substantially enhance data quality and consistently outperform state-of-the-art representation-based selection methods. Notably, CRDS-W achieves strong performance using only 3.5% of the data, surpassing the full-data baseline by an average of 0.71% across four datasets. Our code is available at https://github.com/tdano1/CRDS. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2602_13773 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | On Representation Redundancy in Large-Scale Instruction Tuning Data Selection Shu, Youwei Zheng, Shaomian Jin, Dingnan Qu, Wenjie Guo, Ziyao Cui, Qing Zhou, Jun Zhang, Jiaheng Machine Learning Data quality is a crucial factor in large language models training. While prior work has shown that models trained on smaller, high-quality datasets can outperform those trained on much larger but noisy or low-quality corpora, systematic methods for industrial-scale data selection in instruction tuning remain underexplored. In this work, we study instruction-tuning data selection through the lens of semantic representation similarity and identify a key limitation of state-of-the-art LLM encoders: they produce highly redundant semantic embeddings. To mitigate this redundancy, we propose Compressed Representation Data Selection (CRDS), a novel framework with two variants. CRDS-R applies Rademacher random projection followed by concatenation of transformer hidden-layer representations, while CRDS-W employs whitening-based dimensionality reduction to improve representational quality. Experimental results demonstrate that both variants substantially enhance data quality and consistently outperform state-of-the-art representation-based selection methods. Notably, CRDS-W achieves strong performance using only 3.5% of the data, surpassing the full-data baseline by an average of 0.71% across four datasets. Our code is available at https://github.com/tdano1/CRDS. |
| title | On Representation Redundancy in Large-Scale Instruction Tuning Data Selection |
| topic | Machine Learning |
| url | https://arxiv.org/abs/2602.13773 |