Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhou, Yuhan, Tu, Fengjiao, Sha, Kewei, Ding, Junhua, Chen, Haihua
Format:	Preprint
Published:	2024
Subjects:	Machine Learning Artificial Intelligence
Online Access:	https://arxiv.org/abs/2406.19614
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909232900603904
author	Zhou, Yuhan Tu, Fengjiao Sha, Kewei Ding, Junhua Chen, Haihua
author_facet	Zhou, Yuhan Tu, Fengjiao Sha, Kewei Ding, Junhua Chen, Haihua
contents	Machine learning (ML) technologies have become substantial in practically all aspects of our society, and data quality (DQ) is critical for the performance, fairness, robustness, safety, and scalability of ML models. With the large and complex data in data-centric AI, traditional methods like exploratory data analysis (EDA) and cross-validation (CV) face challenges, highlighting the importance of mastering DQ tools. In this survey, we review 17 DQ evaluation and improvement tools in the last 5 years. By introducing the DQ dimensions, metrics, and main functions embedded in these tools, we compare their strengths and limitations and propose a roadmap for developing open-source DQ tools for ML. Based on the discussions on the challenges and emerging trends, we further highlight the potential applications of large language models (LLMs) and generative AI in DQ evaluation and improvement for ML. We believe this comprehensive survey can enhance understanding of DQ in ML and could drive progress in data-centric AI. A complete list of the literature investigated in this survey is available on GitHub at: https://github.com/haihua0913/awesome-dq4ml.
format	Preprint
id	arxiv_https___arxiv_org_abs_2406_19614
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	A Survey on Data Quality Dimensions and Tools for Machine Learning Zhou, Yuhan Tu, Fengjiao Sha, Kewei Ding, Junhua Chen, Haihua Machine Learning Artificial Intelligence Machine learning (ML) technologies have become substantial in practically all aspects of our society, and data quality (DQ) is critical for the performance, fairness, robustness, safety, and scalability of ML models. With the large and complex data in data-centric AI, traditional methods like exploratory data analysis (EDA) and cross-validation (CV) face challenges, highlighting the importance of mastering DQ tools. In this survey, we review 17 DQ evaluation and improvement tools in the last 5 years. By introducing the DQ dimensions, metrics, and main functions embedded in these tools, we compare their strengths and limitations and propose a roadmap for developing open-source DQ tools for ML. Based on the discussions on the challenges and emerging trends, we further highlight the potential applications of large language models (LLMs) and generative AI in DQ evaluation and improvement for ML. We believe this comprehensive survey can enhance understanding of DQ in ML and could drive progress in data-centric AI. A complete list of the literature investigated in this survey is available on GitHub at: https://github.com/haihua0913/awesome-dq4ml.
title	A Survey on Data Quality Dimensions and Tools for Machine Learning
topic	Machine Learning Artificial Intelligence
url	https://arxiv.org/abs/2406.19614

Similar Items