Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Singh, Priyanshu, Ahuja, Kapil
Format:	Preprint
Published:	2025
Subjects:	Machine Learning Data Structures and Algorithms I.2; F.2
Online Access:	https://arxiv.org/abs/2501.02612
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909824604700672
author	Singh, Priyanshu Ahuja, Kapil
author_facet	Singh, Priyanshu Ahuja, Kapil
contents	Hierarchical clustering remains a fundamental challenge in data mining, particularly when dealing with large-scale datasets where traditional approaches fail to scale effectively. Recent Chameleon-based algorithms - Chameleon2, M-Chameleon, and INNGS-Chameleon have proposed advanced strategies but they still suffer from $O(n^2)$ computational complexity, especially for large datasets. With Chameleon2 as the base algorithm, we introduce Chameleon2++ that addresses this challenge. Our algorithm has three parts. First, Graph Generation - we propose an approximate $k$-NN search instead of an exact one, specifically we integrate with the Annoy algorithm. This results in fast approximate nearest neighbor computation, significantly reducing the graph generation time. Second, Graph Partitioning - we propose use of a multi-level partitioning algorithm instead of a recursive bisection one. Specifically we adapt the hMETIS algorithm instead of the FM. This is because multi-level algorithms are robust to approximation introduced in the graph generation phase yielding higher-quality partitions, and that too with minimum configuration requirements. Third, Merging - we retain the flood fill heuristic that ensures connected balanced components in the partitions as well as efficient partition merging criteria leading to the final clusters. These enhancements reduce the overall time complexity to $O(n\log n)$, achieving scalability. On real-world benchmark datasets used in prior Chameleon works, Chameleon2++ delivers an average of 4% improvement in clustering quality. This demonstrates that algorithmic efficiency and clustering quality can co-exist in large-scale hierarchical clustering.
format	Preprint
id	arxiv_https___arxiv_org_abs_2501_02612
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Chameleon2++: An Efficient and Scalable Variant Of Chameleon Clustering Singh, Priyanshu Ahuja, Kapil Machine Learning Data Structures and Algorithms I.2; F.2 Hierarchical clustering remains a fundamental challenge in data mining, particularly when dealing with large-scale datasets where traditional approaches fail to scale effectively. Recent Chameleon-based algorithms - Chameleon2, M-Chameleon, and INNGS-Chameleon have proposed advanced strategies but they still suffer from $O(n^2)$ computational complexity, especially for large datasets. With Chameleon2 as the base algorithm, we introduce Chameleon2++ that addresses this challenge. Our algorithm has three parts. First, Graph Generation - we propose an approximate $k$-NN search instead of an exact one, specifically we integrate with the Annoy algorithm. This results in fast approximate nearest neighbor computation, significantly reducing the graph generation time. Second, Graph Partitioning - we propose use of a multi-level partitioning algorithm instead of a recursive bisection one. Specifically we adapt the hMETIS algorithm instead of the FM. This is because multi-level algorithms are robust to approximation introduced in the graph generation phase yielding higher-quality partitions, and that too with minimum configuration requirements. Third, Merging - we retain the flood fill heuristic that ensures connected balanced components in the partitions as well as efficient partition merging criteria leading to the final clusters. These enhancements reduce the overall time complexity to $O(n\log n)$, achieving scalability. On real-world benchmark datasets used in prior Chameleon works, Chameleon2++ delivers an average of 4% improvement in clustering quality. This demonstrates that algorithmic efficiency and clustering quality can co-exist in large-scale hierarchical clustering.
title	Chameleon2++: An Efficient and Scalable Variant Of Chameleon Clustering
topic	Machine Learning Data Structures and Algorithms I.2; F.2
url	https://arxiv.org/abs/2501.02612

Similar Items