Saved in:
Bibliographic Details
Main Authors: Fachada, Nuno, de Andrade, Diogo
Format: Preprint
Published: 2023
Subjects:
Online Access:https://arxiv.org/abs/2301.10327
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911788871712768
author Fachada, Nuno
de Andrade, Diogo
author_facet Fachada, Nuno
de Andrade, Diogo
contents Synthetic data is essential for assessing clustering techniques, complementing and extending real data, and allowing for more complete coverage of a given problem's space. In turn, synthetic data generators have the potential of creating vast amounts of data -- a crucial activity when real-world data is at premium -- while providing a well-understood generation procedure and an interpretable instrument for methodically investigating cluster analysis algorithms. Here, we present Clugen, a modular procedure for synthetic data generation, capable of creating multidimensional clusters supported by line segments using arbitrary distributions. Clugen is open source, comprehensively unit tested and documented, and is available for the Python, R, Julia, and MATLAB/Octave ecosystems. We demonstrate that our proposal can produce rich and varied results in various dimensions, is fit for use in the assessment of clustering algorithms, and has the potential to be a widely used framework in diverse clustering-related research tasks.
format Preprint
id arxiv_https___arxiv_org_abs_2301_10327
institution arXiv
publishDate 2023
record_format arxiv
spellingShingle Generating Multidimensional Clusters With Support Lines
Fachada, Nuno
de Andrade, Diogo
Machine Learning
Computer Vision and Pattern Recognition
Programming Languages
I.5; I.2.5
Synthetic data is essential for assessing clustering techniques, complementing and extending real data, and allowing for more complete coverage of a given problem's space. In turn, synthetic data generators have the potential of creating vast amounts of data -- a crucial activity when real-world data is at premium -- while providing a well-understood generation procedure and an interpretable instrument for methodically investigating cluster analysis algorithms. Here, we present Clugen, a modular procedure for synthetic data generation, capable of creating multidimensional clusters supported by line segments using arbitrary distributions. Clugen is open source, comprehensively unit tested and documented, and is available for the Python, R, Julia, and MATLAB/Octave ecosystems. We demonstrate that our proposal can produce rich and varied results in various dimensions, is fit for use in the assessment of clustering algorithms, and has the potential to be a widely used framework in diverse clustering-related research tasks.
title Generating Multidimensional Clusters With Support Lines
topic Machine Learning
Computer Vision and Pattern Recognition
Programming Languages
I.5; I.2.5
url https://arxiv.org/abs/2301.10327