Saved in:
Bibliografiske detaljer
Hovedforfatter: Kumar, Ajay
Format: Recurso digital
Sprog:
Udgivet: Zenodo 2026
Fag:
Online adgang:https://doi.org/10.5281/zenodo.20204514
Tags: Tilføj Tag
Ingen Tags, Vær først til at tagge denne postø!
Indholdsfortegnelse:
  • <p>Processed data and a trained neural network from a faithful reproduction of <a href="https://arxiv.org/abs/1907.03041">Eastman & Pande, "Predicting Gene Expression Between Species with Neural Networks" (arXiv:1907.03041, 2019)</a>.</p><p>The reproduction pipeline is open-source at <a href="https://github.com/Ajay1989Kumar/eastman-pande-2019-repro">Ajay1989Kumar/eastman-pande-2019-repro</a>. This Zenodo record contains the processed artifacts the code produces, so downstream users can run the evaluation, train alternative models, or verify the reproduction without re-running the full pipeline.</p><h3>Contents</h3><ul><li><b>rma_rat.tsv</b>, <b>rma_human.tsv</b> — RMA + BrainArray ENTREZG v22 normalized expression matrices for Open TG-GATEs rat and human in-vitro liver microarrays. Log₂ scale, Entrez gene IDs as rows, sample barcodes as columns. Shapes: 14,132×3,226 (rat); 20,414×2,397 (human).</li><li><b>pairs_rat_X.tsv</b>, <b>pairs_human_Y.tsv</b> — replicate-averaged matrices paired on (compound, dose_level, time). 1,122 paired conditions across 139 compounds. ML-ready: rows aligned across both files via <b>pairs_metadata.tsv</b> and split via <b>pairs_splits.tsv</b> into 1,024 train pairs (125 compounds) and 98 test pairs (14 compounds — the exact set used in Eastman & Pande 2019).</li><li><b>results_predictions.tsv</b>, <b>results_sigma.tsv</b> — neural network predictions for the 98 test samples (mean and combined aleatoric + epistemic uncertainty from 50 MC-dropout passes).</li><li><b>results_overall.tsv</b>, <b>results_per_sample.tsv</b>, <b>results_top100.tsv</b>, <b>results_fig5.tsv</b> — evaluation outputs: overall Pearson r and MAE, per-sample correlations, top-100 differentially expressed gene overlap per test compound (paper's Table 1), and MAE×compound×dose at 24h (paper's Figure 5).</li><li><b>training_log.tsv</b> — per-epoch loss and learning rate, 1,000 epochs.</li><li><b>train_means.npz</b> — per-gene training-set means used for centering predictions (required to compare against the paper's centered Pearson r).</li><li><b>model_checkpoint.pt</b> (if present) — PyTorch state dict for the trained 1.1 B-parameter network. Single hidden layer of width 20,000, ReLU, dropout 0.5, dual heads for mean and log-variance, trained 1,000 epochs with Adam.</li></ul><h3>Reproduction summary</h3><p>Metrics evaluated in the centered space the paper uses (per-gene training-set mean subtracted from both predictions and ground truth before correlating):</p><table><tbody><tr><th>Metric</th><th>Eastman & Pande 2019</th><th>This reproduction</th></tr><tr><td>Overall Pearson r</td><td>0.697</td><td>0.696</td></tr><tr><td>Overall MAE</td><td>0.158</td><td>0.162</td></tr><tr><td>Per-sample r (median)</td><td>0.791</td><td>0.779</td></tr><tr><td>Top-100 DE overlap (mean / cmpd)</td><td>41.1</td><td>40.4</td></tr></tbody></table><p>End-to-end correctness audit (9 checks; see <code>audit_pipeline.py</code> in the GitHub repo) passes with no data leakage detected: train/test compound disjointness, train-only centering means, NaN-free predictions and weights, monotone learning-rate decay over 1,000 epochs.</p><h3>Source data</h3><p>Raw CEL files (not redistributed here) are from the <a href="https://toxico.nibiohn.go.jp/english/">Open TG-GATEs</a> toxicogenomics database at NIBIOHN. RMA normalization used the <a href="https://brainarray.mbni.med.umich.edu/">BrainArray</a> ENTREZG v22 custom CDF packages from the University of Michigan.</p>