Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Diop, Lamine, Plantevit, Marc
Format:	Preprint
Published:	2024
Subjects:	Databases Machine Learning 60: Probability theory G.3; E.1; E.2; F.2
Online Access:	https://arxiv.org/abs/2410.22964
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917823473778688
author	Diop, Lamine Plantevit, Marc
author_facet	Diop, Lamine Plantevit, Marc
contents	Discovering valuable insights from data through meaningful associations is a crucial task. However, it becomes challenging when trying to identify representative patterns in quantitative databases, especially with large datasets, as enumeration-based strategies struggle due to the vast search space involved. To tackle this challenge, output space sampling methods have emerged as a promising solution thanks to its ability to discover valuable patterns with reduced computational overhead. However, existing sampling methods often encounter limitations when dealing with large quantitative database, resulting in scalability-related challenges. In this work, we propose a novel high utility pattern sampling algorithm and its on-disk version both designed for large quantitative databases based on two original theorems. Our approach ensures both the interactivity required for user-centered methods and strong statistical guarantees through random sampling. Thanks to our method, users can instantly discover relevant and representative utility pattern, facilitating efficient exploration of the database within seconds. To demonstrate the interest of our approach, we present a compelling use case involving archaeological knowledge graph sub-profiles discovery. Experiments on semantic and none-semantic quantitative databases show that our approach outperforms the state-of-the art methods.
format	Preprint
id	arxiv_https___arxiv_org_abs_2410_22964
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Scalable Sampling for High Utility Patterns Diop, Lamine Plantevit, Marc Databases Machine Learning 60: Probability theory G.3; E.1; E.2; F.2 Discovering valuable insights from data through meaningful associations is a crucial task. However, it becomes challenging when trying to identify representative patterns in quantitative databases, especially with large datasets, as enumeration-based strategies struggle due to the vast search space involved. To tackle this challenge, output space sampling methods have emerged as a promising solution thanks to its ability to discover valuable patterns with reduced computational overhead. However, existing sampling methods often encounter limitations when dealing with large quantitative database, resulting in scalability-related challenges. In this work, we propose a novel high utility pattern sampling algorithm and its on-disk version both designed for large quantitative databases based on two original theorems. Our approach ensures both the interactivity required for user-centered methods and strong statistical guarantees through random sampling. Thanks to our method, users can instantly discover relevant and representative utility pattern, facilitating efficient exploration of the database within seconds. To demonstrate the interest of our approach, we present a compelling use case involving archaeological knowledge graph sub-profiles discovery. Experiments on semantic and none-semantic quantitative databases show that our approach outperforms the state-of-the art methods.
title	Scalable Sampling for High Utility Patterns
topic	Databases Machine Learning 60: Probability theory G.3; E.1; E.2; F.2
url	https://arxiv.org/abs/2410.22964

Similar Items