Saved in:
| Main Author: | |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2603.03235 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866917456098885632 |
|---|---|
| author | Perez-Reche, Francisco J. |
| author_facet | Perez-Reche, Francisco J. |
| contents | Selecting the number of clusters remains a fundamental challenge in unsupervised learning. Existing approaches typically focus on identifying a single "optimal" partition, often overlooking statistically meaningful structure present across multiple resolutions. We introduce ElbowSig, a general inferential framework for assessing clustering structure over a range of resolutions. The method formalizes the elbow heuristic by defining a normalized discrete curvature statistic based on the sequence of within-cluster heterogeneity values, and evaluates its significance relative to a null distribution of unstructured data. This yields hypothesis tests across resolutions, enabling simultaneous inference at multiple clustering scales. We derive the asymptotic behavior of the null statistic in both large-sample and high-dimensional regimes, characterizing its limiting form and variability. Because it depends only on the heterogeneity sequence, ElbowSig is compatible with a wide range of clustering algorithms, including hard, fuzzy, and model-based methods. Experiments on synthetic and real datasets show that the procedure controls Type-I error under unstructured data while providing power to detect multiscale organization, revealing structure that is often missed by single-resolution selection criteria. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2603_03235 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | The elbow statistic: Multiscale clustering statistical significance Perez-Reche, Francisco J. Machine Learning Methodology Selecting the number of clusters remains a fundamental challenge in unsupervised learning. Existing approaches typically focus on identifying a single "optimal" partition, often overlooking statistically meaningful structure present across multiple resolutions. We introduce ElbowSig, a general inferential framework for assessing clustering structure over a range of resolutions. The method formalizes the elbow heuristic by defining a normalized discrete curvature statistic based on the sequence of within-cluster heterogeneity values, and evaluates its significance relative to a null distribution of unstructured data. This yields hypothesis tests across resolutions, enabling simultaneous inference at multiple clustering scales. We derive the asymptotic behavior of the null statistic in both large-sample and high-dimensional regimes, characterizing its limiting form and variability. Because it depends only on the heterogeneity sequence, ElbowSig is compatible with a wide range of clustering algorithms, including hard, fuzzy, and model-based methods. Experiments on synthetic and real datasets show that the procedure controls Type-I error under unstructured data while providing power to detect multiscale organization, revealing structure that is often missed by single-resolution selection criteria. |
| title | The elbow statistic: Multiscale clustering statistical significance |
| topic | Machine Learning Methodology |
| url | https://arxiv.org/abs/2603.03235 |