Salvato in:
Dettagli Bibliografici
Autori principali: Xu, Haochuan, Pham, Ninh
Natura: Preprint
Pubblicazione: 2024
Soggetti:
Accesso online:https://arxiv.org/abs/2402.15679
Tags: Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!
_version_ 1866910948482088960
author Xu, Haochuan
Pham, Ninh
author_facet Xu, Haochuan
Pham, Ninh
contents We present sDBSCAN, a scalable density-based clustering algorithm in high dimensions with cosine distance. Utilizing the neighborhood-preserving property of random projections, sDBSCAN can quickly identify core points and their neighborhoods, the primary hurdle of density-based clustering. Theoretically, sDBSCAN outputs a clustering structure similar to DBSCAN under mild conditions with high probability. To further facilitate sDBSCAN, we present sOPTICS, a scalable OPTICS for interactive exploration of the intrinsic clustering structure. We also extend sDBSCAN and sOPTICS to L2, L1, $χ^2$, and Jensen-Shannon distances via random kernel features. Empirically, sDBSCAN is significantly faster and provides higher accuracy than many other clustering algorithms on real-world million-point data sets. On these data sets, sDBSCAN and sOPTICS run in a few minutes, while the scikit-learn's counterparts demand several hours or cannot run due to memory constraints.
format Preprint
id arxiv_https___arxiv_org_abs_2402_15679
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Scalable Density-based Clustering with Random Projections
Xu, Haochuan
Pham, Ninh
Machine Learning
Computer Vision and Pattern Recognition
We present sDBSCAN, a scalable density-based clustering algorithm in high dimensions with cosine distance. Utilizing the neighborhood-preserving property of random projections, sDBSCAN can quickly identify core points and their neighborhoods, the primary hurdle of density-based clustering. Theoretically, sDBSCAN outputs a clustering structure similar to DBSCAN under mild conditions with high probability. To further facilitate sDBSCAN, we present sOPTICS, a scalable OPTICS for interactive exploration of the intrinsic clustering structure. We also extend sDBSCAN and sOPTICS to L2, L1, $χ^2$, and Jensen-Shannon distances via random kernel features. Empirically, sDBSCAN is significantly faster and provides higher accuracy than many other clustering algorithms on real-world million-point data sets. On these data sets, sDBSCAN and sOPTICS run in a few minutes, while the scikit-learn's counterparts demand several hours or cannot run due to memory constraints.
title Scalable Density-based Clustering with Random Projections
topic Machine Learning
Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2402.15679