Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2510.01632 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866914568276541440 |
|---|---|
| author | Wang, Xin Shi, Kaiwen Oliver, Carlos |
| author_facet | Wang, Xin Shi, Kaiwen Oliver, Carlos |
| contents | Protein function is driven by cohesive substructures, such as catalytic triads, binding pockets, and structural motifs, that occupy only a small fraction of a protein's residues. Yet existing pipelines built on protein encoders do not model proteins at the substructure level, leaving the central biological question unanswered: which substructure of a protein is responsible for its function? We introduce BioBlobs, an encoder-agnostic, end-to-end differentiable framework that compresses a protein into a small set of cohesive substructures (blobs) and predicts function from these blobs alone, so that each blob corresponds to a candidate functional region. Across diverse protein function prediction tasks and multiple sequence- and structure-based encoders, BioBlobs matches or exceeds strong baselines while operating on only a small fraction of residues. The discovered blobs adapt their spatial scale to the task, ranging from local catalytic sites to entire structural domains. Trained only on protein-level labels, BioBlobs recovers experimentally annotated catalytic sites in the M-CSA database, demonstrating unsupervised functional substructure discovery and opening a path to large-scale functional site discovery across the unannotated proteome. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2510_01632 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | BioBlobs: Unsupervised Discovery of Functional Substructures for Protein Function Prediction Wang, Xin Shi, Kaiwen Oliver, Carlos Biomolecules Artificial Intelligence Protein function is driven by cohesive substructures, such as catalytic triads, binding pockets, and structural motifs, that occupy only a small fraction of a protein's residues. Yet existing pipelines built on protein encoders do not model proteins at the substructure level, leaving the central biological question unanswered: which substructure of a protein is responsible for its function? We introduce BioBlobs, an encoder-agnostic, end-to-end differentiable framework that compresses a protein into a small set of cohesive substructures (blobs) and predicts function from these blobs alone, so that each blob corresponds to a candidate functional region. Across diverse protein function prediction tasks and multiple sequence- and structure-based encoders, BioBlobs matches or exceeds strong baselines while operating on only a small fraction of residues. The discovered blobs adapt their spatial scale to the task, ranging from local catalytic sites to entire structural domains. Trained only on protein-level labels, BioBlobs recovers experimentally annotated catalytic sites in the M-CSA database, demonstrating unsupervised functional substructure discovery and opening a path to large-scale functional site discovery across the unannotated proteome. |
| title | BioBlobs: Unsupervised Discovery of Functional Substructures for Protein Function Prediction |
| topic | Biomolecules Artificial Intelligence |
| url | https://arxiv.org/abs/2510.01632 |