Saved in:
Bibliographic Details
Main Authors: Wang, Xin, Shi, Kaiwen, Oliver, Carlos
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2510.01632
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866914568276541440
author Wang, Xin
Shi, Kaiwen
Oliver, Carlos
author_facet Wang, Xin
Shi, Kaiwen
Oliver, Carlos
contents Protein function is driven by cohesive substructures, such as catalytic triads, binding pockets, and structural motifs, that occupy only a small fraction of a protein's residues. Yet existing pipelines built on protein encoders do not model proteins at the substructure level, leaving the central biological question unanswered: which substructure of a protein is responsible for its function? We introduce BioBlobs, an encoder-agnostic, end-to-end differentiable framework that compresses a protein into a small set of cohesive substructures (blobs) and predicts function from these blobs alone, so that each blob corresponds to a candidate functional region. Across diverse protein function prediction tasks and multiple sequence- and structure-based encoders, BioBlobs matches or exceeds strong baselines while operating on only a small fraction of residues. The discovered blobs adapt their spatial scale to the task, ranging from local catalytic sites to entire structural domains. Trained only on protein-level labels, BioBlobs recovers experimentally annotated catalytic sites in the M-CSA database, demonstrating unsupervised functional substructure discovery and opening a path to large-scale functional site discovery across the unannotated proteome.
format Preprint
id arxiv_https___arxiv_org_abs_2510_01632
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle BioBlobs: Unsupervised Discovery of Functional Substructures for Protein Function Prediction
Wang, Xin
Shi, Kaiwen
Oliver, Carlos
Biomolecules
Artificial Intelligence
Protein function is driven by cohesive substructures, such as catalytic triads, binding pockets, and structural motifs, that occupy only a small fraction of a protein's residues. Yet existing pipelines built on protein encoders do not model proteins at the substructure level, leaving the central biological question unanswered: which substructure of a protein is responsible for its function? We introduce BioBlobs, an encoder-agnostic, end-to-end differentiable framework that compresses a protein into a small set of cohesive substructures (blobs) and predicts function from these blobs alone, so that each blob corresponds to a candidate functional region. Across diverse protein function prediction tasks and multiple sequence- and structure-based encoders, BioBlobs matches or exceeds strong baselines while operating on only a small fraction of residues. The discovered blobs adapt their spatial scale to the task, ranging from local catalytic sites to entire structural domains. Trained only on protein-level labels, BioBlobs recovers experimentally annotated catalytic sites in the M-CSA database, demonstrating unsupervised functional substructure discovery and opening a path to large-scale functional site discovery across the unannotated proteome.
title BioBlobs: Unsupervised Discovery of Functional Substructures for Protein Function Prediction
topic Biomolecules
Artificial Intelligence
url https://arxiv.org/abs/2510.01632