Affichage MARC: :: Library Catalog

Enregistré dans:

Détails bibliographiques
Auteur principal:	Daras, John N.
Format:	Preprint
Publié:	2025
Sujets:	Machine Learning Databases Quantitative Methods
Accès en ligne:	https://arxiv.org/abs/2509.23942
Tags:	Ajouter un tag Pas de tags, Soyez le premier à ajouter un tag!

_version_	1866915520151814144
author	Daras, John N.
author_facet	Daras, John N.
contents	Advancements in tools like Shapely 2.0 and Triton can significantly improve the efficiency of spatial similarity computations by enabling faster and more scalable geometric operations. However, for extremely large datasets, these optimizations may face challenges due to the sheer volume of computations required. To address this, we propose a framework that reduces the number of clusters requiring verification, thereby decreasing the computational load on these systems. The framework integrates dynamic similarity index thresholding, supervised scheduling, and recall-constrained optimization to efficiently identify clusters with the highest spatial similarity while meeting user-defined precision and recall requirements. By leveraging Kernel Density Estimation (KDE) to dynamically determine similarity thresholds and machine learning models to prioritize clusters, our approach achieves substantial reductions in computational cost without sacrificing accuracy. Experimental results demonstrate the scalability and effectiveness of the method, offering a practical solution for large-scale geospatial analysis.
format	Preprint
id	arxiv_https___arxiv_org_abs_2509_23942
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Efficient Identification of High Similarity Clusters in Polygon Datasets Daras, John N. Machine Learning Databases Quantitative Methods Advancements in tools like Shapely 2.0 and Triton can significantly improve the efficiency of spatial similarity computations by enabling faster and more scalable geometric operations. However, for extremely large datasets, these optimizations may face challenges due to the sheer volume of computations required. To address this, we propose a framework that reduces the number of clusters requiring verification, thereby decreasing the computational load on these systems. The framework integrates dynamic similarity index thresholding, supervised scheduling, and recall-constrained optimization to efficiently identify clusters with the highest spatial similarity while meeting user-defined precision and recall requirements. By leveraging Kernel Density Estimation (KDE) to dynamically determine similarity thresholds and machine learning models to prioritize clusters, our approach achieves substantial reductions in computational cost without sacrificing accuracy. Experimental results demonstrate the scalability and effectiveness of the method, offering a practical solution for large-scale geospatial analysis.
title	Efficient Identification of High Similarity Clusters in Polygon Datasets
topic	Machine Learning Databases Quantitative Methods
url	https://arxiv.org/abs/2509.23942

Documents similaires