Saved in:
Bibliographic Details
Main Authors: Choudhury, Shubham, Raghava, Gajendra
Format: Recurso digital
Language:
Published: Zenodo 2026
Online Access:https://doi.org/10.5281/zenodo.19878400
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • <p class="ds-markdown-paragraph"><strong>Title:</strong><br>PDAC-LLM Dataset – Blood-derived exosomal transcriptomic profiles of pancreatic ductal adenocarcinoma (PDAC) patients converted to peptide sequences for large language model classification</p> <p class="ds-markdown-paragraph"><strong>Description:</strong></p> <p class="ds-markdown-paragraph"><strong>Project:</strong> PDAC-LLM – A large language model for predicting pancreatic ductal adenocarcinoma patients from blood-derived exosomal transcriptomics data</p> <p class="ds-markdown-paragraph"><strong>Publication:</strong> Choudhury, S., Mehta, N.K., & Raghava, G.P.S. (2025). A large language model for predicting pancreatic ductal adenocarcinoma patients from blood-derived exosomal transcriptomics data. <em>bioRxiv</em>. <a href="https://doi.org/10.1101/2025.03.06.641795" rel="noopener noreferrer">https://doi.org/10.1101/2025.03.06.641795</a></p> <p class="ds-markdown-paragraph"><strong>Overview:</strong><br>This repository accompanies the PDAC-LLM publication and presents a <strong>novel paradigm</strong> – transforming numerical gene expression data into peptide sequences for classification using large language models (LLMs). Unlike traditional machine learning approaches that convert text to numerical features, this study reverses the process: numerical transcriptomic profiles are converted into pseudo‑peptide sequences, enabling LLMs (trained on protein sequences) to mine cancer patient data. This method was applied to predict pancreatic ductal adenocarcinoma (PDAC), the fourth leading cause of cancer‑related deaths globally (5‑year survival <8%), using blood‑derived exosomal transcriptomics – a non‑invasive diagnostic approach.</p> <p class="ds-markdown-paragraph"><strong>Content:</strong><br>The repository contains transcriptomic expression data (TPM values) from 501 patients (284 PDAC, 217 non‑PDAC) derived from blood‑derived exosomes (GEO dataset). After feature selection (50 discriminative genes), expression probabilities were converted into peptide sequences (lengths 5–50 amino acids), which were used to fine‑tune and evaluate multiple LLMs (PeptideBERT, ProtBERT, ESM2 variants).</p> <p class="ds-markdown-paragraph"><strong>Dataset summary:</strong></p> <ul> <li> <p class="ds-markdown-paragraph"><strong>Total samples:</strong> 501 (284 PDAC, 217 non‑PDAC)</p> </li> <li> <p class="ds-markdown-paragraph"><strong>Training set:</strong> 401 samples (80%)</p> </li> <li> <p class="ds-markdown-paragraph"><strong>Validation set:</strong> 100 samples (20%)</p> </li> <li> <p class="ds-markdown-paragraph"><strong>Original feature space:</strong> 54,148 genes</p> </li> <li> <p class="ds-markdown-paragraph"><strong>Selected features (final):</strong> 50 genes (identified via 8 feature selection methods; LGBM as primary)</p> </li> <li> <p class="ds-markdown-paragraph"><strong>Peptide lengths evaluated:</strong> 5, 8, 10, 15, 20, 25, 30, 40, 50 residues</p> </li> <li> <p class="ds-markdown-paragraph"><strong>Novelty:</strong> First study demonstrating LLM application for mining transcriptomic profiles of cancer patients</p> </li> </ul> <p class="ds-markdown-paragraph"><strong>Key Gene Biomarkers (Top 5 from LGBM feature selection):</strong></p> <div class="ds-scroll-area ds-scroll-area--show-on-focus-within _1210dd7 c03cafe9 _5ac647c"> <div class="ds-scroll-area__gutters"> <div class="ds-scroll-area__horizontal-gutter"> </div> <div class="ds-scroll-area__vertical-gutter"> </div> </div> <table> <tbody><tr> <th>Gene</th> <th>Role in PDAC</th> </tr> </tbody><tbody> <tr> <td><strong>CLDN1</strong></td> <td>Promotes tumor growth, chemoresistance, invasiveness (cell junctions, metabolic pathways)</td> </tr> <tr> <td><strong>IL7R</strong></td> <td>Regulates immune cell infiltration in tumor microenvironment; influences prognosis</td> </tr> <tr> <td><strong>ITIH2</strong></td> <td>Metastasis suppressor – loss linked to increased PDAC aggressiveness</td> </tr> <tr> <td><strong>KRT19</strong></td> <td>Ductal marker; diagnostic biomarker; facilitates immune evasion; associated with poor prognosis</td> </tr> <tr> <td><strong>MBNL1</strong></td> <td>RNA‑binding protein; restricts metastatic progression (alternative splicing regulation)</td> </tr> </tbody> </table> </div> <p class="ds-markdown-paragraph"><strong>Data Processing Workflow:</strong></p> <ol> <li> <p class="ds-markdown-paragraph"><strong>TPM expression matrix</strong> (501 samples × 54,148 genes) – downloaded from GEO</p> </li> <li> <p class="ds-markdown-paragraph"><strong>Logistic regression per gene</strong> – trained on training set (401 samples) to generate probability scores</p> </li> <li> <p class="ds-markdown-paragraph"><strong>Probability matrix</strong> (501 samples × 54,148 genes) – each value represents likelihood of PDAC based on single gene</p> </li> <li> <p class="ds-markdown-paragraph"><strong>Feature selection</strong> – 8 methods (LASSO, Elastic Net, Random Forest, LGBM, Information Gain, ReliefF, RFE+SVM, RFE+RF); 9 subset sizes (5–50 genes)</p> </li> <li> <p class="ds-markdown-paragraph"><strong>Textual representation</strong> – Probability values mapped to amino acids (0–0.05 → A, 0.05–0.10 → C, etc.)</p> </li> <li> <p class="ds-markdown-paragraph"><strong>Pseudo‑peptide generation</strong> – 50‑residue sequences per patient (concatenated amino acids from 50 selected genes)</p> </li> <li> <p class="ds-markdown-paragraph"><strong>LLM fine‑tuning</strong> – PeptideBERT, ProtBERT, t6_8M_UR50D, t12_35M_UR50D, t33_650M_UR50D</p> </li> </ol> <p class="ds-markdown-paragraph"> </p> <p class="ds-markdown-paragraph"><strong>Model Performance (Validation Set – 100 samples, peptide length = 50):</strong></p> <div class="ds-scroll-area ds-scroll-area--show-on-focus-within _1210dd7 c03cafe9 _5ac647c"> <div class="ds-scroll-area__gutters"> <div class="ds-scroll-area__horizontal-gutter"> </div> <div class="ds-scroll-area__vertical-gutter"> </div> </div> <table> <tbody><tr> <th>Model</th> <th>Sensitivity</th> <th>Specificity</th> <th>Precision</th> <th>Accuracy</th> <th>MCC</th> <th>F1‑Score</th> <th><strong>AUC</strong></th> </tr> </tbody><tbody> <tr> <td><strong>ProtBERT</strong></td> <td>0.894</td> <td>0.863</td> <td>0.894</td> <td>0.881</td> <td>0.758</td> <td>0.894</td> <td><strong>0.962</strong></td> </tr> <tr> <td>PeptideBERT</td> <td>0.929</td> <td>0.613</td> <td>0.757</td> <td>0.792</td> <td>0.584</td> <td>0.834</td> <td>0.942</td> </tr> <tr> <td>t12_35M_UR50D (ESM2)</td> <td>0.894</td> <td>0.818</td> <td>0.864</td> <td>0.861</td> <td>0.717</td> <td>0.879</td> <td>0.941</td> </tr> <tr> <td>t6_8M_UR50D (ESM2)</td> <td>0.877</td> <td>0.818</td> <td>0.862</td> <td>0.851</td> <td>0.697</td> <td>0.869</td> <td>0.936</td> </tr> <tr> <td>t33_650M_UR50D (ESM2)</td> <td>0.842</td> <td>0.818</td> <td>0.857</td> <td>0.832</td> <td>0.658</td> <td>0.849</td> <td>0.914</td> </tr> </tbody> </table> </div> <p class="ds-markdown-paragraph"><strong>Best Overall Model:</strong> <strong>ProtBERT (fine‑tuned on PDAC pseudo‑peptides)</strong></p> <ul> <li> <p class="ds-markdown-paragraph"><strong>AUC:</strong> 0.962</p> </li> <li> <p class="ds-markdown-paragraph"><strong>MCC:</strong> 0.758</p> </li> <li> <p class="ds-markdown-paragraph"><strong>Accuracy:</strong> 88.1%</p> </li> <li> <p class="ds-markdown-paragraph"><strong>Sensitivity:</strong> 89.4%</p> </li> <li> <p class="ds-markdown-paragraph"><strong>Specificity:</strong> 86.3%</p> </li> </ul> <p class="ds-markdown-paragraph"><strong>Performance across peptide lengths (AUC – Validation set):</strong></p> <div class="ds-scroll-area ds-scroll-area--show-on-focus-within _1210dd7 c03cafe9 _5ac647c"> <div class="ds-scroll-area__gutters"> <div class="ds-scroll-area__horizontal-gutter"> <div class="ds-scroll-area__horizontal-bar"> </div> </div> <div class="ds-scroll-area__vertical-gutter"> </div> </div> <table> <tbody><tr> <th>Model</th> <th>k=5</th> <th>k=8</th> <th>k=10</th> <th>k=15</th> <th>k=20</th> <th>k=25</th> <th>k=30</th> <th>k=40</th> <th>k=50</th> </tr> </tbody><tbody> <tr> <td>PeptideBERT</td> <td>0.903</td> <td>0.918</td> <td>0.928</td> <td><strong>0.955</strong></td> <td><strong>0.961</strong></td> <td>0.951</td> <td>0.950</td> <td>0.951</td> <td>0.942</td> </tr> <tr> <td><strong>ProtBERT</strong></td> <td>0.825</td> <td>0.824</td> <td>0.918</td> <td>0.957</td> <td>0.957</td> <td>0.905</td> <td>0.957</td> <td><strong>0.961</strong></td> <td><strong>0.962</strong></td> </tr> <tr> <td>t12_35M_UR50D</td> <td>0.870</td> <td>0.865</td> <td><strong>0.932</strong></td> <td>0.913</td> <td>0.935</td> <td>0.934</td> <td>0.944</td> <td>0.939</td> <td>0.941</td> </tr> </tbody> </table> </div> <p class="ds-markdown-paragraph"><strong>Observation:</strong> PeptideBERT performs better on shorter peptides (k=15–20), while ProtBERT excels on longer peptides (k=40–50).</p> <p class="ds-markdown-paragraph"><strong>Alignment‑based Methods (MERCI – peptide length 50, validation set):</strong></p> <div class="ds-scroll-area ds-scroll-area--show-on-focus-within _1210dd7 c03cafe9 _5ac647c"> <div class="ds-scroll-area__gutters"> <div class="ds-scroll-area__horizontal-gutter"> </div> <div class="ds-scroll-area__vertical-gutter"> </div> </div> <table> <tbody><tr> <th>Metric</th> <th>Value</th> </tr> </tbody><tbody> <tr> <td>Total hits</td> <td>102</td> </tr> <tr> <td>Correct hits</td> <td>82</td> </tr> <tr> <td>Incorrect hits</td> <td>20</td> </tr> <tr> <td>Positive hits</td> <td>77</td> </tr> <tr> <td>Negative hits</td> <td>25</td> </tr> <tr> <td>Correct positive hits</td> <td>62</td> </tr> <tr> <td>Correct negative hits</td> <td>20</td> </tr> </tbody> </table> </div> <p class="ds-markdown-paragraph"><strong>Ensemble Models (LLM + MERCI – best improvements):</strong></p> <div class="ds-scroll-area ds-scroll-area--show-on-focus-within _1210dd7 c03cafe9 _5ac647c"> <div class="ds-scroll-area__gutters"> <div class="ds-scroll-area__horizontal-gutter"> </div> <div class="ds-scroll-area__vertical-gutter"> </div> </div> <table> <tbody><tr> <th>Model</th> <th>AUC (LLM alone)</th> <th>AUC (LLM + MERCI)</th> <th>Improvement</th> </tr> </tbody><tbody> <tr> <td>ProtBERT (k=40)</td> <td>0.961</td> <td><strong>0.966</strong></td> <td>+0.005</td> </tr> <tr> <td>ProtBERT (k=8)</td> <td>0.824</td> <td><strong>0.874</strong></td> <td>+0.050</td> </tr> <tr> <td>PeptideBERT (k=8)</td> <td>0.918</td> <td><strong>0.925</strong></td> <td>+0.007</td> </tr> <tr> <td>t33_650M_UR50D (k=20)</td> <td>0.945</td> <td><strong>0.947</strong></td> <td>+0.002</td> </tr> </tbody> </table> </div> <p class="ds-markdown-paragraph"><strong>MERIC consistently improved LLM performance, particularly for smaller peptide lengths.</strong></p> <p class="ds-markdown-paragraph"><strong>Benchmarking Comparison with existing methods:</strong></p> <div class="ds-scroll-area ds-scroll-area--show-on-focus-within _1210dd7 c03cafe9 _5ac647c"> <div class="ds-scroll-area__gutters"> <div class="ds-scroll-area__horizontal-gutter"> </div> <div class="ds-scroll-area__vertical-gutter"> </div> </div> <table> <tbody><tr> <th>Method</th> <th>Biomarker panel size</th> <th>AUC</th> <th>Notes</th> </tr> </tbody><tbody> <tr> <td><strong>ProtBERT (this work)</strong></td> <td>50 genes</td> <td><strong>0.962</strong></td> <td>LLM fine‑tuned on pseudo‑peptides</td> </tr> <tr> <td><strong>t33_650M_UR50D + MR (this work)</strong></td> <td>10 genes</td> <td><strong>0.939</strong></td> <td>Ensemble (LLM + MERCI)</td> </tr> <tr> <td><strong>PeptideBERT (this work)</strong></td> <td>5 genes</td> <td><strong>0.903</strong></td> <td>Single LLM, minimal gene panel</td> </tr> <tr> <td>Wang et al. (2020)</td> <td>8 genes</td> <td>0.936</td> <td>SVM on TPM expression</td> </tr> <tr> <td>Wang et al. (2024)</td> <td>4 lncRNAs</td> <td>0.848</td> <td>lncRNA signature</td> </tr> </tbody> </table> </div> <p class="ds-markdown-paragraph"><strong>Our LLM‑based approach outperforms existing SVM‑based methods, demonstrating the power of converting numerical transcriptomic data into sequence representations for LLM classification.</strong></p> <p class="ds-markdown-paragraph"><strong>Feature Selection Methods Evaluated (8 methods, 9 subset sizes):</strong></p> <div class="ds-scroll-area ds-scroll-area--show-on-focus-within _1210dd7 c03cafe9 _5ac647c"> <div class="ds-scroll-area__gutters"> <div class="ds-scroll-area__horizontal-gutter"> </div> <div class="ds-scroll-area__vertical-gutter"> </div> </div> <table> <tbody><tr> <th>Category</th> <th>Method</th> <th>Implementation</th> </tr> </tbody><tbody> <tr> <td>Embedded regularization</td> <td>LASSO (L1)</td> <td>LassoCV (sklearn)</td> </tr> <tr> <td>Embedded regularization</td> <td>Elastic Net</td> <td>ElasticNetCV (sklearn)</td> </tr> <tr> <td>Tree‑based importance</td> <td>Random Forest</td> <td>RandomForestClassifier (Gini importance)</td> </tr> <tr> <td>Tree‑based importance</td> <td>LGBM</td> <td>LGBMClassifier (feature usage / gain)</td> </tr> <tr> <td>Filter</td> <td>Information Gain</td> <td>mutual_info_classif (sklearn)</td> </tr> <tr> <td>Filter</td> <td>ReliefF</td> <td>skrebate library</td> </tr> <tr> <td>Wrapper</td> <td>RFE + SVM</td> <td>Recursive Feature Elimination (linear kernel)</td> </tr> <tr> <td>Wrapper</td> <td>RFE + RF</td> <td>Recursive Feature Elimination (Random Forest)</td> </tr> </tbody> </table> </div> <p class="ds-markdown-paragraph"><strong>LGBM provided the highest‑ranking gene list (used for final peptide generation).</strong></p> <p class="ds-markdown-paragraph"><strong>Data Curation & Quality Control:</strong></p> <ul> <li> <p class="ds-markdown-paragraph"><strong>Source:</strong> GEO (Gene Expression Omnibus) – blood‑derived exosomal transcriptomics</p> </li> <li> <p class="ds-markdown-paragraph"><strong>Samples:</strong> 284 PDAC patients + 217 non‑PDAC controls</p> </li> <li> <p class="ds-markdown-paragraph"><strong>Data type:</strong> TPM (transcripts per million) expression values</p> </li> <li> <p class="ds-markdown-paragraph"><strong>Normalization:</strong> Z‑score scaling prior to feature selection</p> </li> <li> <p class="ds-markdown-paragraph"><strong>Probability conversion:</strong> Logistic regression per gene (trained on 401 training samples)</p> </li> <li> <p class="ds-markdown-paragraph"><strong>Amino acid mapping:</strong> 20 probability bins → 20 standard amino acids (A–Y)</p> </li> <li> <p class="ds-markdown-paragraph"><strong>Redundancy check:</strong> None (each sample is a unique patient)</p> </li> <li> <p class="ds-markdown-paragraph"><strong>Train/validation split:</strong> 80/20 stratified (401 train, 100 validation)</p> </li> </ul> <p class="ds-markdown-paragraph"><strong>Usage:</strong><br>These datasets and models are designed for:</p> <ul> <li> <p class="ds-markdown-paragraph"><strong>Novel methodology training</strong> – converting numerical omics data (transcriptomics, proteomics, metabolomics) into sequence representations for LLM classification</p> </li> <li> <p class="ds-markdown-paragraph">PDAC diagnostics using blood‑based, non‑invasive exosomal transcriptomics</p> </li> <li> <p class="ds-markdown-paragraph">Fine‑tuning protein LLMs (ProtBERT, ESM2, PeptideBERT) for cancer classification tasks</p> </li> <li> <p class="ds-markdown-paragraph">Benchmarking LLM‑based vs. traditional ML (SVM, RF, LASSO) on transcriptomic data</p> </li> <li> <p class="ds-markdown-paragraph">Identifying minimal gene biomarker panels (as low as 5–10 genes) for PDAC detection</p> </li> <li> <p class="ds-markdown-paragraph">Ensemble methods combining LLM predictions with alignment‑based approaches (MERCI, BLAST)</p> </li> </ul> <p class="ds-markdown-paragraph"><strong>Novel Contribution:</strong><br>To the best of our knowledge, this is the <strong>first study</strong> demonstrating the application of large language models for mining transcriptomic profiles of cancer patients. The reverse strategy (numeric → text → LLM) opens new avenues for applying LLMs to any numerical biomedical data (expression profiles, clinical parameters, imaging features) by converting them into sequence representations.</p> <p class="ds-markdown-paragraph"><strong>Related Resources:</strong></p> <ul> <li> <p class="ds-markdown-paragraph">bioRxiv preprint: <a href="https://doi.org/10.1101/2025.03.06.641795" rel="noopener noreferrer">https://doi.org/10.1101/2025.03.06.641795</a></p> </li> <li> <p class="ds-markdown-paragraph">GitHub (code and models): To be released upon publication</p> </li> <li> <p class="ds-markdown-paragraph">GEO dataset accession: (refer to original Wang et al. 2020 study)</p> </li> </ul> <p class="ds-markdown-paragraph"><strong>License:</strong> CC BY‑NC‑ND 4.0 </p> <p class="ds-markdown-paragraph"><strong>Contact:</strong><br>Prof. Gajendra P. S. Raghava</p>