Saved in:
Bibliographic Details
Main Authors: Sharma, Ruben, King, Ross D.
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2511.05728
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866908638084333568
author Sharma, Ruben
King, Ross D.
author_facet Sharma, Ruben
King, Ross D.
contents We introduce the first formal large-scale assessment of the utility of traditional chemical functional groups as used in chemical explanations. Our assessment employs a fundamental principle from computational learning theory: a good explanation of data should also compress the data. We introduce an unsupervised learning algorithm based on the Minimum Message Length (MML) principle that searches for substructures that compress around three million biologically relevant molecules. We demonstrate that the discovered substructures contain most human-curated functional groups as well as novel larger patterns with more specific functions. We also run our algorithm on 24 specific bioactivity prediction datasets to discover dataset-specific functional groups. Fingerprints constructed from dataset-specific functional groups are shown to significantly outperform other fingerprint representations, including the MACCS and Morgan fingerprint, when training ridge regression models on bioactivity regression tasks.
format Preprint
id arxiv_https___arxiv_org_abs_2511_05728
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Compressing Chemistry Reveals Functional Groups
Sharma, Ruben
King, Ross D.
Machine Learning
Artificial Intelligence
Information Theory
We introduce the first formal large-scale assessment of the utility of traditional chemical functional groups as used in chemical explanations. Our assessment employs a fundamental principle from computational learning theory: a good explanation of data should also compress the data. We introduce an unsupervised learning algorithm based on the Minimum Message Length (MML) principle that searches for substructures that compress around three million biologically relevant molecules. We demonstrate that the discovered substructures contain most human-curated functional groups as well as novel larger patterns with more specific functions. We also run our algorithm on 24 specific bioactivity prediction datasets to discover dataset-specific functional groups. Fingerprints constructed from dataset-specific functional groups are shown to significantly outperform other fingerprint representations, including the MACCS and Morgan fingerprint, when training ridge regression models on bioactivity regression tasks.
title Compressing Chemistry Reveals Functional Groups
topic Machine Learning
Artificial Intelligence
Information Theory
url https://arxiv.org/abs/2511.05728