Saved in:
Bibliographic Details
Main Authors: Malikussaid, Floresko, Septian Caesar, Sutiyo
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2601.18921
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866917354510745600
author Malikussaid
Floresko, Septian Caesar
Sutiyo
author_facet Malikussaid
Floresko, Septian Caesar
Sutiyo
contents The integration of large-scale chemical databases represents a critical bottleneck in modern cheminformatics research, particularly for machine learning applications requiring high-quality, multi-source validated datasets. This paper presents a case study of integrating three major public chemical repositories: PubChem (176 million compounds), ChEMBL, and eMolecules, to construct a curated dataset for molecular property prediction. We investigate whether byte-offset indexing can practically overcome brute-force scalability limits while preserving data integrity at hundred-million scale. Our results document the progression from an intractable brute-force search algorithm with projected 100-day runtime to a byte-offset indexing architecture achieving 3.2-hour completion - a 740-fold performance improvement through algorithmic complexity reduction from $O(N \times M)$ to $O(N + M)$. Systematic validation of 176 million database entries revealed hash collisions in InChIKey molecular identifiers, necessitating pipeline reconstruction using collision-free full InChI strings. We present performance benchmarks, quantify trade-offs between storage overhead and scientific rigor, and compare our approach with alternative large-scale integration strategies. The resulting system successfully extracted 435,413 validated compounds and demonstrates generalizable principles for large-scale scientific data integration where uniqueness constraints exceed hash-based identifier capabilities.
format Preprint
id arxiv_https___arxiv_org_abs_2601_18921
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Accelerating Large-Scale Cheminformatics Using a Byte-Offset Indexing Architecture for Terabyte-Scale Data Integration
Malikussaid
Floresko, Septian Caesar
Sutiyo
Databases
Computational Engineering, Finance, and Science
Machine Learning
Quantitative Methods
68P20 (Primary) 68P05, 68P10, 92E10 (Secondary)
H.2.8; H.3.1; E.5
The integration of large-scale chemical databases represents a critical bottleneck in modern cheminformatics research, particularly for machine learning applications requiring high-quality, multi-source validated datasets. This paper presents a case study of integrating three major public chemical repositories: PubChem (176 million compounds), ChEMBL, and eMolecules, to construct a curated dataset for molecular property prediction. We investigate whether byte-offset indexing can practically overcome brute-force scalability limits while preserving data integrity at hundred-million scale. Our results document the progression from an intractable brute-force search algorithm with projected 100-day runtime to a byte-offset indexing architecture achieving 3.2-hour completion - a 740-fold performance improvement through algorithmic complexity reduction from $O(N \times M)$ to $O(N + M)$. Systematic validation of 176 million database entries revealed hash collisions in InChIKey molecular identifiers, necessitating pipeline reconstruction using collision-free full InChI strings. We present performance benchmarks, quantify trade-offs between storage overhead and scientific rigor, and compare our approach with alternative large-scale integration strategies. The resulting system successfully extracted 435,413 validated compounds and demonstrates generalizable principles for large-scale scientific data integration where uniqueness constraints exceed hash-based identifier capabilities.
title Accelerating Large-Scale Cheminformatics Using a Byte-Offset Indexing Architecture for Terabyte-Scale Data Integration
topic Databases
Computational Engineering, Finance, and Science
Machine Learning
Quantitative Methods
68P20 (Primary) 68P05, 68P10, 92E10 (Secondary)
H.2.8; H.3.1; E.5
url https://arxiv.org/abs/2601.18921