Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Abraham, Ashley N., Strelzoff, Andrew, Dozier, Haley R., Henslee, Althea C., Chappell, Mark A.
Format:	Preprint
Published:	2026
Subjects:	Machine Learning Performance
Online Access:	https://arxiv.org/abs/2604.21645
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866918464531202048
author	Abraham, Ashley N. Strelzoff, Andrew Dozier, Haley R. Henslee, Althea C. Chappell, Mark A.
author_facet	Abraham, Ashley N. Strelzoff, Andrew Dozier, Haley R. Henslee, Althea C. Chappell, Mark A.
contents	Large-scale Nearest Neighbor (NN) search, though widely utilized in the similarity search field, remains challenged by the computational limitations inherent in processing large scale data. In an effort to decrease the computational expense needed, Approximate Nearest Neighbor (ANN) search is often used in applications that do not require the exact similarity search, but instead can rely on an approximation. Product Quantization (PQ) is a memory-efficient ANN effective for clustering all sizes of datasets. Clustering large-scale, high dimensional data requires a heavy computational expense, in both memory-cost and execution time. This work focuses on a unique way to divide and conquer the large scale data in Python using PQ, Inverted Indexing and Dask, combining the results without compromising the accuracy and reducing computational requirements to the level required when using medium-scale data.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_21645
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Large-Scale Data Parallelization of Product Quantization and Inverted Indexing Using Dask Abraham, Ashley N. Strelzoff, Andrew Dozier, Haley R. Henslee, Althea C. Chappell, Mark A. Machine Learning Performance Large-scale Nearest Neighbor (NN) search, though widely utilized in the similarity search field, remains challenged by the computational limitations inherent in processing large scale data. In an effort to decrease the computational expense needed, Approximate Nearest Neighbor (ANN) search is often used in applications that do not require the exact similarity search, but instead can rely on an approximation. Product Quantization (PQ) is a memory-efficient ANN effective for clustering all sizes of datasets. Clustering large-scale, high dimensional data requires a heavy computational expense, in both memory-cost and execution time. This work focuses on a unique way to divide and conquer the large scale data in Python using PQ, Inverted Indexing and Dask, combining the results without compromising the accuracy and reducing computational requirements to the level required when using medium-scale data.
title	Large-Scale Data Parallelization of Product Quantization and Inverted Indexing Using Dask
topic	Machine Learning Performance
url	https://arxiv.org/abs/2604.21645

Similar Items