Guardado en:
| Autores principales: | , |
|---|---|
| Formato: | Recurso digital |
| Lenguaje: | |
| Publicado: |
Zenodo
2025
|
| Acceso en línea: | https://doi.org/10.5281/zenodo.15792613 |
| Etiquetas: |
Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
Tabla de Contenidos:
- <h1>Learning Topological Features for Protein Stability Prediction</h1> <p><strong>Author:</strong> Amish Mishra<br><strong>Date:</strong> June 25, 2025</p> <h2> Project Overview</h2> <p>This project investigates the use of <strong>topological data analysis (TDA)</strong> to predict <strong>protein stability</strong> from structural data. We work with synthetic protein designs from the <a href="https://www.science.org/doi/10.1126/science.aan0693">Rocklin dataset</a>, applying machine learning models to classify proteins as <strong>stable</strong> or <strong>unstable</strong> using features derived from:</p> <ol> <li> <p><strong>Persistence diagrams</strong> processed via <a href="https://arxiv.org/abs/1702.07959">Cover-tree Differencing via Entropy Reduction (CDER)</a>,</p> </li> <li> <p><strong>Subject matter expert (SME) features</strong>, and</p> </li> <li> <p><strong>A combination of CDER and SME features</strong>.</p> </li> </ol> <h2> Research Questions</h2> <p>We explore the following key questions:</p> <ol> <li> <p>What areas of the persistence diagram are characteristic of stable/unstable proteins (based on CDER coordinates)?</p> </li> <li> <p>What are the effects on model performance when incorporating topological information into the classification task of these proteins? Does adding features generated using CDER to an SME model improve the model?</p> </li> <li> <p>What are the correlations between the CDER and SME features? What does a highly correlated CDER feature and SME feature tell us?</p> </li> <li> <p>What are the most important features for the classifiers?</p> </li> </ol> <h2> Protein Design Notation</h2> <p>We use a simplified English notation to represent secondary protein structures. Below is the conversion from structure notation to symbols:</p> <table> <thead> <tr> <th>Design Label</th> <th>Secondary Structure</th> </tr> </thead> <tbody> <tr> <td>HHH</td> <td>ααα</td> </tr> <tr> <td>HEEH</td> <td>αββα</td> </tr> <tr> <td>EHEE</td> <td>βαββ</td> </tr> <tr> <td>EEHEE</td> <td>ββαββ</td> </tr> </tbody> </table> <p>Where:</p> <ul> <li> <p><strong>α (alpha)</strong> = alpha helix</p> </li> <li> <p><strong>β (beta)</strong> = beta sheet</p> </li> </ul> <h2>⚙️ Installation Guide</h2> <p>To get started, follow the steps below:</p> <ol> <li> <p><strong>Download and unzip the repo and make it your working directory</strong></p> </li> <li> <p><strong>Create a conda environment</strong></p> <pre><code>conda env create -f environment.yml </code></pre> </li> <li> <p><strong>Activate the environment</strong></p> <pre><code>conda activate cder2</code></pre> </li> <li> <p><strong>Install the CDER dependency</strong></p> <ul> <li> <p>Required for running any CDER-based notebooks.</p> </li> <li> <p>Follow instructions at: <a href="https://github.com/geomdata/gda-public">https://github.com/geomdata/gda-public</a></p> </li> </ul> </li> <li> <p><strong>Launch Jupyter Notebooks</strong></p> <pre><code>jupyter notebook </code></pre> </li> </ol> <h2> Directory Structure</h2> <table> <thead> <tr> <th>Directory</th> <th>Description</th> </tr> </thead> <tbody> <tr> <td><code>cder_feature_importances_dataframes</code></td> <td>CDER feature importance data</td> </tr> <tr> <td><code>cder_models</code></td> <td>Trained CDER models</td> </tr> <tr> <td><code>classifiers</code></td> <td>Best classifiers from hyperparameter tuning</td> </tr> <tr> <td><code>features_dataframes</code></td> <td>Normalized features used for training</td> </tr> <tr> <td><code>figures</code></td> <td>Plots and visualizations</td> </tr> <tr> <td><code>perf_dataframes</code></td> <td>Performance results of ML models</td> </tr> <tr> <td><code>protein_metadata</code></td> <td>CSV files with SME features and stability labels</td> </tr> <tr> <td><code>protein_pdbs/rocklin_2017</code></td> <td>Raw <code>.pdb</code> protein structure files</td> </tr> <tr> <td><code>protein_pds</code></td> <td>Persistence diagrams (PDs)</td> </tr> <tr> <td><code>sme_feature_importances_dataframes</code></td> <td>SME feature importances by design topology</td> </tr> </tbody> </table> <p> </p> <h2> Notebook Workflow</h2> <p>Run the notebooks in the following order:</p> <ol> <li> <p><code>01make_pds.ipynb</code> – Generate persistence diagrams (PDs)</p> </li> <li> <p><code>02make_main_df.ipynb</code> – Combine stability, SME, and PD data into one dataframe</p> </li> <li> <p><code>03analyze_protein_df.ipynb</code> – Visualize the spread of stable vs. unstable proteins</p> </li> <li> <p><code>04sme_ml.ipynb</code> – Train/test ML model using SME features</p> </li> <li> <p><code>05cder-ml.ipynb</code> – Train/test ML model using CDER features from PDs</p> </li> <li> <p><code>06sme-cder-ml.ipynb</code> – Train/test ML model using both SME and CDER features</p> </li> <li> <p><code>07make_feature_imp_df.ipynb</code> – Compute average SME feature importance by design topology</p> </li> <li> <p><code>08analyze_performance.ipynb</code> – Compare model performance and plot feature importances</p> </li> <li> <p><code>09visualize_cder_correlations.ipynb</code> – Plot CDER-SME feature correlations and highlight informative features</p> </li> <li> <p><code>hex_plot_example.ipynb</code> – Visualize region-of-difference hexplots on PDs</p> </li> </ol>