Enregistré dans:
Détails bibliographiques
Auteur principal: Daras, John N.
Format: Preprint
Publié: 2025
Sujets:
Accès en ligne:https://arxiv.org/abs/2509.23942
Tags: Ajouter un tag
Pas de tags, Soyez le premier à ajouter un tag!
_version_ 1866915520151814144
author Daras, John N.
author_facet Daras, John N.
contents Advancements in tools like Shapely 2.0 and Triton can significantly improve the efficiency of spatial similarity computations by enabling faster and more scalable geometric operations. However, for extremely large datasets, these optimizations may face challenges due to the sheer volume of computations required. To address this, we propose a framework that reduces the number of clusters requiring verification, thereby decreasing the computational load on these systems. The framework integrates dynamic similarity index thresholding, supervised scheduling, and recall-constrained optimization to efficiently identify clusters with the highest spatial similarity while meeting user-defined precision and recall requirements. By leveraging Kernel Density Estimation (KDE) to dynamically determine similarity thresholds and machine learning models to prioritize clusters, our approach achieves substantial reductions in computational cost without sacrificing accuracy. Experimental results demonstrate the scalability and effectiveness of the method, offering a practical solution for large-scale geospatial analysis.
format Preprint
id arxiv_https___arxiv_org_abs_2509_23942
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Efficient Identification of High Similarity Clusters in Polygon Datasets
Daras, John N.
Machine Learning
Databases
Quantitative Methods
Advancements in tools like Shapely 2.0 and Triton can significantly improve the efficiency of spatial similarity computations by enabling faster and more scalable geometric operations. However, for extremely large datasets, these optimizations may face challenges due to the sheer volume of computations required. To address this, we propose a framework that reduces the number of clusters requiring verification, thereby decreasing the computational load on these systems. The framework integrates dynamic similarity index thresholding, supervised scheduling, and recall-constrained optimization to efficiently identify clusters with the highest spatial similarity while meeting user-defined precision and recall requirements. By leveraging Kernel Density Estimation (KDE) to dynamically determine similarity thresholds and machine learning models to prioritize clusters, our approach achieves substantial reductions in computational cost without sacrificing accuracy. Experimental results demonstrate the scalability and effectiveness of the method, offering a practical solution for large-scale geospatial analysis.
title Efficient Identification of High Similarity Clusters in Polygon Datasets
topic Machine Learning
Databases
Quantitative Methods
url https://arxiv.org/abs/2509.23942