Saved in:
| Main Authors: | , |
|---|---|
| Format: | Recurso digital |
| Language: | English |
| Published: |
Zenodo
2026
|
| Subjects: | |
| Online Access: | https://doi.org/10.5281/zenodo.20019351 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Table of Contents:
- <h2>PWAV–ChemBERTa: Reproducible Framework for Molecular Property Prediction</h2> <p>This repository provides a fully reproducible implementation of the <em>Path-Weighted Atom Vectors (PWAV)</em> descriptor framework and its integration with ChemBERTa for molecular property prediction. The package includes all data, descriptor files, model implementations, and scripts required to reproduce the experimental results reported in the associated manuscript.</p> <h3>Overview</h3> <p>Molecular property prediction plays a critical role in cheminformatics and environmental chemistry, supporting tasks such as risk assessment and molecular design. This work introduces PWAV, an interpretable descriptor family that encodes atom-level and path-based structural information. The framework evaluates PWAV both as a standalone representation and in combination with transformer-based SMILES embeddings via a gated fusion architecture.</p> <p>The repository enables systematic evaluation across six physicochemical endpoints:</p> <ul> <li> <p>Octanol–water partition coefficient (log P)</p> </li> <li> <p>Aqueous solubility (log S)</p> </li> <li> <p>Bioconcentration factor (log BCF)</p> </li> <li> <p>Boiling point (BP)</p> </li> <li> <p>Melting point (MP)</p> </li> <li> <p>Vapor pressure (log VP)</p> </li> </ul> <h3>Contents</h3> <p>The package includes:</p> <ul> <li> <p><strong>Raw dataset</strong> (Zang et al.) and cleaned descriptor files in Parquet format</p> </li> <li> <p><strong>Descriptor generation pipeline</strong> (MACCS, Morgan, PWAV)</p> </li> <li> <p><strong>Machine learning models</strong>:</p> <ul> <li> <p>XGBoost baseline models</p> </li> <li> <p>Feed-forward neural networks (ANN)</p> </li> <li> <p>ChemBERTa-based fusion models</p> </li> </ul> </li> <li> <p><strong>SHAP-based feature analysis</strong> for interpretability and PWAV-64 construction</p> </li> <li> <p><strong>Nested cross-validation framework</strong> for robust performance estimation</p> </li> <li> <p><strong>Ablation and stress-test configurations</strong> for fusion analysis</p> </li> <li> <p>Fully structured <strong>command-line scripts</strong> for all experiments</p> </li> </ul> <h3>Key Features</h3> <ul> <li> <p>Reproducible end-to-end pipeline with consistent data handling</p> </li> <li> <p>Training-only PCA for PWAV dimensionality reduction</p> </li> <li> <p>SHAP-based feature selection (PWAV-64)</p> </li> <li> <p>Modular fusion framework with gating, FiLM conditioning, and modality dropout</p> </li> <li> <p>Support for both benchmark and scaffold splits</p> </li> <li> <p>Structured output hierarchy suitable for publication and analysis</p> </li> </ul> <h3>Reproducibility</h3> <p>All experiments can be reproduced using the provided scripts without external dependencies beyond the included dataset. Default configurations correspond to those used in the manuscript, including fixed random seeds and benchmark splits for comparability.</p> <h3>Intended Use</h3> <p>This repository is intended for:</p> <ul> <li> <p>Reproducing results from the associated study</p> </li> <li> <p>Benchmarking descriptor-based and hybrid molecular representations</p> </li> <li> <p>Extending PWAV descriptors or integrating with alternative models (e.g., graph neural networks)</p> </li> <li> <p>Exploring interpretability and feature attribution in molecular prediction tasks</p> </li> </ul> <h3>Notes</h3> <ul> <li> <p>ChemBERTa-based experiments may benefit from GPU acceleration for efficient training.</p> </li> </ul> <h3>License</h3> <p>This software is released under the MIT License.</p> <h3>Citation</h3> <p>If you use this repository, please cite the associated manuscript and this Zenodo record.</p>