Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Malikussaid, Floresko, Septian Caesar, Sutiyo
Format:	Preprint
Published:	2026
Subjects:	Databases Computational Engineering, Finance, and Science Machine Learning Quantitative Methods 68P20 (Primary) 68P05, 68P10, 92E10 (Secondary) H.2.8; H.3.1; E.5
Online Access:	https://arxiv.org/abs/2601.18921
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917354510745600
author	Malikussaid Floresko, Septian Caesar Sutiyo
author_facet	Malikussaid Floresko, Septian Caesar Sutiyo
contents	The integration of large-scale chemical databases represents a critical bottleneck in modern cheminformatics research, particularly for machine learning applications requiring high-quality, multi-source validated datasets. This paper presents a case study of integrating three major public chemical repositories: PubChem (176 million compounds), ChEMBL, and eMolecules, to construct a curated dataset for molecular property prediction. We investigate whether byte-offset indexing can practically overcome brute-force scalability limits while preserving data integrity at hundred-million scale. Our results document the progression from an intractable brute-force search algorithm with projected 100-day runtime to a byte-offset indexing architecture achieving 3.2-hour completion - a 740-fold performance improvement through algorithmic complexity reduction from $O(N \times M)$ to $O(N + M)$. Systematic validation of 176 million database entries revealed hash collisions in InChIKey molecular identifiers, necessitating pipeline reconstruction using collision-free full InChI strings. We present performance benchmarks, quantify trade-offs between storage overhead and scientific rigor, and compare our approach with alternative large-scale integration strategies. The resulting system successfully extracted 435,413 validated compounds and demonstrates generalizable principles for large-scale scientific data integration where uniqueness constraints exceed hash-based identifier capabilities.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_18921
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Accelerating Large-Scale Cheminformatics Using a Byte-Offset Indexing Architecture for Terabyte-Scale Data Integration Malikussaid Floresko, Septian Caesar Sutiyo Databases Computational Engineering, Finance, and Science Machine Learning Quantitative Methods 68P20 (Primary) 68P05, 68P10, 92E10 (Secondary) H.2.8; H.3.1; E.5 The integration of large-scale chemical databases represents a critical bottleneck in modern cheminformatics research, particularly for machine learning applications requiring high-quality, multi-source validated datasets. This paper presents a case study of integrating three major public chemical repositories: PubChem (176 million compounds), ChEMBL, and eMolecules, to construct a curated dataset for molecular property prediction. We investigate whether byte-offset indexing can practically overcome brute-force scalability limits while preserving data integrity at hundred-million scale. Our results document the progression from an intractable brute-force search algorithm with projected 100-day runtime to a byte-offset indexing architecture achieving 3.2-hour completion - a 740-fold performance improvement through algorithmic complexity reduction from $O(N \times M)$ to $O(N + M)$. Systematic validation of 176 million database entries revealed hash collisions in InChIKey molecular identifiers, necessitating pipeline reconstruction using collision-free full InChI strings. We present performance benchmarks, quantify trade-offs between storage overhead and scientific rigor, and compare our approach with alternative large-scale integration strategies. The resulting system successfully extracted 435,413 validated compounds and demonstrates generalizable principles for large-scale scientific data integration where uniqueness constraints exceed hash-based identifier capabilities.
title	Accelerating Large-Scale Cheminformatics Using a Byte-Offset Indexing Architecture for Terabyte-Scale Data Integration
topic	Databases Computational Engineering, Finance, and Science Machine Learning Quantitative Methods 68P20 (Primary) 68P05, 68P10, 92E10 (Secondary) H.2.8; H.3.1; E.5
url	https://arxiv.org/abs/2601.18921

Similar Items