Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2504.01206 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866915489596309504 |
|---|---|
| author | Łukasiewicz, Aleksander Tětek, Jakub Veselý, Pavel |
| author_facet | Łukasiewicz, Aleksander Tětek, Jakub Veselý, Pavel |
| contents | Space-efficient streaming estimation of quantiles in massive datasets is a fundamental problem with numerous applications in data monitoring and analysis. While theoretical research led to optimal algorithms, such as the Greenwald-Khanna algorithm or the KLL sketch, practitioners often use other sketches that perform significantly better in practice but lack theoretical guarantees. Most notably, the widely used $t$-digest has unbounded worst-case error.
In this paper, we seek to get the best of both worlds. We present a new quantile summary, SplineSketch, for numeric data, offering near-optimal theoretical guarantees, namely uniformly bounded rank error, and outperforming $t$-digest by a factor of 2-20 on a range of synthetic and real-world datasets. To achieve such performance, we develop a novel approach that maintains a dynamic subdivision of the input range into buckets while fitting the input distribution using monotone cubic spline interpolation. The core challenge is implementing this method in a space-efficient manner while ensuring strong worst-case guarantees. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2504_01206 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | SplineSketch: Even More Accurate Quantiles with Error Guarantees Łukasiewicz, Aleksander Tětek, Jakub Veselý, Pavel Data Structures and Algorithms Databases Computation Space-efficient streaming estimation of quantiles in massive datasets is a fundamental problem with numerous applications in data monitoring and analysis. While theoretical research led to optimal algorithms, such as the Greenwald-Khanna algorithm or the KLL sketch, practitioners often use other sketches that perform significantly better in practice but lack theoretical guarantees. Most notably, the widely used $t$-digest has unbounded worst-case error. In this paper, we seek to get the best of both worlds. We present a new quantile summary, SplineSketch, for numeric data, offering near-optimal theoretical guarantees, namely uniformly bounded rank error, and outperforming $t$-digest by a factor of 2-20 on a range of synthetic and real-world datasets. To achieve such performance, we develop a novel approach that maintains a dynamic subdivision of the input range into buckets while fitting the input distribution using monotone cubic spline interpolation. The core challenge is implementing this method in a space-efficient manner while ensuring strong worst-case guarantees. |
| title | SplineSketch: Even More Accurate Quantiles with Error Guarantees |
| topic | Data Structures and Algorithms Databases Computation |
| url | https://arxiv.org/abs/2504.01206 |