Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Cappuzzo, Riccardo, Coelho, Aimee, Lefebvre, Felix, Papotti, Paolo, Varoquaux, Gael
Format:	Preprint
Published:	2024
Subjects:	Databases Machine Learning
Online Access:	https://arxiv.org/abs/2402.06282
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909615377088512
author	Cappuzzo, Riccardo Coelho, Aimee Lefebvre, Felix Papotti, Paolo Varoquaux, Gael
author_facet	Cappuzzo, Riccardo Coelho, Aimee Lefebvre, Felix Papotti, Paolo Varoquaux, Gael
contents	Machine-learning from a disparate set of tables, a data lake, requires assembling features by merging and aggregating tables. Data discovery can extend autoML to data tables by automating these steps. We present an in-depth analysis of such automated table augmentation for machine learning tasks, analyzing different methods for the three main steps: retrieving joinable tables, merging information, and predicting with the resultant table. We use two data lakes: Open Data US, a well-referenced real data lake, and a novel semi-synthetic dataset, YADL (Yet Another Data Lake), which we developed as a tool for benchmarking this data discovery task. Systematic exploration on both lakes outlines 1) the importance of accurately retrieving candidate tables to join, 2) the efficiency of simple merging methods, and 3) the resilience of tree-based learners to noisy conditions. Our experimental environment is easily reproducible and based on open data, to foster more research on feature engineering, autoML, and learning in data lakes.
format	Preprint
id	arxiv_https___arxiv_org_abs_2402_06282
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Retrieve, Merge, Predict: Augmenting Tables with Data Lakes Cappuzzo, Riccardo Coelho, Aimee Lefebvre, Felix Papotti, Paolo Varoquaux, Gael Databases Machine Learning Machine-learning from a disparate set of tables, a data lake, requires assembling features by merging and aggregating tables. Data discovery can extend autoML to data tables by automating these steps. We present an in-depth analysis of such automated table augmentation for machine learning tasks, analyzing different methods for the three main steps: retrieving joinable tables, merging information, and predicting with the resultant table. We use two data lakes: Open Data US, a well-referenced real data lake, and a novel semi-synthetic dataset, YADL (Yet Another Data Lake), which we developed as a tool for benchmarking this data discovery task. Systematic exploration on both lakes outlines 1) the importance of accurately retrieving candidate tables to join, 2) the efficiency of simple merging methods, and 3) the resilience of tree-based learners to noisy conditions. Our experimental environment is easily reproducible and based on open data, to foster more research on feature engineering, autoML, and learning in data lakes.
title	Retrieve, Merge, Predict: Augmenting Tables with Data Lakes
topic	Databases Machine Learning
url	https://arxiv.org/abs/2402.06282

Similar Items