Saved in:
| Main Author: | |
|---|---|
| Format: | Recurso digital |
| Language: | |
| Published: |
Zenodo
2025
|
| Online Access: | https://doi.org/10.5281/zenodo.17581962 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Table of Contents:
- <p><strong>2,313 curated strain-specific genome-scale metabolic models (GEMs) for <em>Escherichia coli</em></strong>, plus supporting data for pangenome-scale metabolic reconstruction and analysis.</p> <p> <strong>Interactive GEM browser:</strong> <a href="https://omidard.github.io/EcopanGEM/">https://omidard.github.io/EcopanGEM/</a><br> <strong>Code repository:</strong> <a href="https://github.com/omidard/EcopanGEM">https://github.com/omidard/EcopanGEM</a><br> <strong>Data setup:</strong> <code>cd data && make fetch-data</code> (see repository README) </p> <h3>File Descriptions</h3> <table> <tbody><tr><th>File</th><th>Size</th><th>Description</th></tr> </tbody><tbody> <tr> <td><strong>Ecoli_GEMs_for_Complete_genomes.zip</strong></td> <td>314 MB</td> <td>2,313 curated strain-specific genome-scale metabolic models (GEMs) in COBRApy JSON format. Each file is named by its NCBI assembly accession or BV-BRC genome ID (e.g., <code>GCF_003966425.1.json.json</code>, <code>562.72354.json.json</code>). Models contain reactions, metabolites, genes, GPR rules, and flux bounds. Browse, search, and download individual models at <a href="https://omidard.github.io/EcopanGEM/">omidard.github.io/EcopanGEM</a>.</td> </tr> <tr> <td><strong>marlbr2.mat</strong></td> <td>42 MB</td> <td>Reference genome-scale metabolic model (iML1515 derivative) for <em>E. coli</em> K-12 MG1655 in MATLAB format. Used as the template for orthology-based GEM drafting (<code>pangem.py</code>) and essential gap-filling (<code>ecoli_gapfilling6.py</code>).</td> </tr> <tr> <td><strong>universal_model.json</strong></td> <td>21 MB</td> <td>Universal metabolic model (iB21_1397) in COBRApy JSON format. Used during the curation step (<code>Eco_panGEM_curation.py</code>) to extract special reactions (ENLIPIDAt2ex, THZPSN, ICHORS, GLYCTO1). Referenced as <code>iB21_1397.json</code> in the pipeline scripts.</td> </tr> <tr> <td><strong>biolog.csv</strong></td> <td>1.4 MB</td> <td>Biolog PM1/PM2 carbon-source growth phenotypes for <em>E. coli</em> strains. Columns: Strain, Media, Met_Id (metabolite identifiers joined by semicolons), Growth (binary). Used as ground truth for FBA growth prediction validation (<code>biolog_ecoli_prediction.py</code>). Referenced as <code>biolog_panGEM.csv</code> in the pipeline scripts.</td> </tr> <tr> <td><strong>ecoli_gprs.csv.zip</strong></td> <td>22 MB</td> <td>Gene–protein–reaction (GPR) allele matrix (reactions × GEMs). Each cell contains the allele identifiers for that reaction in that strain. Produced by <code>eco_gems_allels.py</code>. Compressed CSV.</td> </tr> <tr> <td><strong>pangenome.csv.zip</strong></td> <td>39 MB</td> <td>Pangenome presence/absence matrix at 80% CD-HIT protein identity threshold. Rows = protein clusters, columns = genomes, values = binary (1/0). Compressed CSV.</td> </tr> <tr> <td><strong>pangenome_s.zip</strong></td> <td>1.6 GB</td> <td>CD-HIT sensitivity analysis outputs across 7 identity thresholds (65%, 70%, 75%, 80%, 85%, 90%, 95%). Contains per-threshold files: <code>cluster_to_locus_*.json</code> (cluster membership maps) and <code>presence_absence_matrix_*.csv</code> (presence/absence matrices). Used by <code>pangenome_sensitivity_results.py</code> to generate cluster-count and CAR (Core/Accessory/Rare) sensitivity figures.</td> </tr> <tr> <td><strong>locustags_genes_mapping.pkl</strong></td> <td>2.4 GB</td> <td>Python pickle (pandas-compatible) dictionary mapping locus tags to gene names across all 2,377 complete <em>E. coli</em> genomes. Used by gene neighborhood analysis (<code>genes_neighborhood_analysis_total_preparation.py</code>) and GPR allele export (<code>eco_gems_allels.py</code>).</td> </tr> <tr> <td><strong>header_to_allele.pickle.zip</strong></td> <td>665 MB</td> <td>Python pickle dictionary mapping FASTA sequence headers to CD-HIT allele cluster identifiers. Used by <code>eco_gems_allels.py</code> to map locus tags to allele identifiers in the GPR matrix. Compressed.</td> </tr> <tr> <td><strong>phylon_locustags_df.csv.zip</strong></td> <td>45 MB</td> <td>Locus-tag-level phylogroup assignments. Columns include locus_tag, genome_id, and phylogroup. Used for phylogenetic stratification of metabolic gene content. Compressed CSV.</td> </tr> <tr> <td><strong>curated_metadata_mash_filtered.pickle</strong></td> <td>8.4 MB</td> <td>Curated metadata for 15,259 <em>E. coli</em> genomes as a pandas DataFrame (Python pickle). 58 columns including: assembly_accession, genome_id, strain, phylogroup, MLST, MASH cluster, isolation_source, isolation_country, host_name, disease, genome_length, gc_content, sequencing_platform, and more. Quality-filtered using MASH-based clustering.</td> </tr> <tr> <td><strong>all_reactions_gene_neighborhood.csv.zip</strong></td> <td>241 MB</td> <td>Gene neighborhood analysis results for all metabolic reactions. Each row links a locus tag to its associated reaction, genome, gene name, pangenome category, and the products of the 2 upstream and 2 downstream neighboring genes. Produced by <code>genes_neighborhood_analysis_total2.py</code>. Compressed CSV.</td> </tr> <tr> <td><strong>Unique_ModelSEED_Reaction_Aliases.txt</strong></td> <td>7.8 MB</td> <td>ModelSEED reaction identifier to alternative database ID mappings (KEGG, MetaCyc, BiGG, etc.). Tab-separated reference lookup table used during reaction curation to cross-reference reaction identifiers across databases.</td> </tr> <tr> <td><strong>Scripts.zip</strong></td> <td>55 KB</td> <td>Original pipeline scripts at time of initial submission. <strong>For the latest version of all scripts, use the GitHub repository:</strong> <a href="https://github.com/omidard/EcopanGEM">github.com/omidard/EcopanGEM</a>.</td> </tr> <tr> <td><strong>Notebook_panGEM_panGPR.ipynb</strong></td> <td>3 MB</td> <td>Jupyter notebook for panGEM/panGPR analysis and visualization (knockout simulations, fitness analysis, Biolog validation). <strong>For the latest version, use the GitHub repository:</strong> <a href="https://github.com/omidard/EcopanGEM">github.com/omidard/EcopanGEM</a>.</td> </tr> </tbody> </table> <h3>Quick Start</h3> <ol> <li>Clone the repository: <code>git clone https://github.com/omidard/EcopanGEM.git</code></li> <li>Download data: <code>cd EcopanGEM/data && make fetch-data</code></li> <li>Create conda environment: <code>conda env create -f scripts/environment.yml</code></li> <li>See the <a href="https://github.com/omidard/EcopanGEM#end-to-end-pipeline-required-order">README</a> for the full pipeline execution order.</li> </ol>