Saved in:
Bibliographic Details
Main Author: Farookhi, Heba
Format: Recurso digital
Language:English
Published: Zenodo 2025
Subjects:
Online Access:https://doi.org/10.5281/zenodo.14638949
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • <div> <div> <div> <div><code><span># H5N1 Wastewater Detection Demo Dataset</span> <span>7</span> <span>8</span><span>## Overview</span> <span>9</span>This dataset combines simulated H5N1 influenza reads with real wastewater metagenome data to create a benchmark <span>for</span> viral detection methods. It simulates a scenario where a novel H5N1 strain is present <span>in</span> urban wastewater at detectable levels. <span>10</span> <span>11</span><span>## Dataset Composition</span> <span>12</span>- Total reads: <span>707,830</span> <span>13</span> - H5N1 reads: <span>1,120</span> <span>(</span><span>0.16</span>%<span>)</span> <span>14</span> - Wastewater reads: <span>706,710</span> <span>(</span><span>99.84</span>%<span>)</span> <span>15</span> <span>16</span><span>### Viral Content Breakdown</span></code></div> </div> </div> </div> <p>Total Viruses (0.34% of all reads):<br>├── H5N1 (0.16%)<br>├── Caudovirales (0.18%)<br>│ ├── Siphoviridae (0.12%)<br>│ └── Podoviridae (0.06%)<br>├── Microviridae (0.08%)<br>└── Other viruses (0.02%)</p> <div> <div> <div>Collapse</div> </div> <div> <div> <div><code><span>1</span> <span>2</span>## Data Sources <span>3</span> <span>4</span>### H5N1 Component <span>5</span>- Source: Influenza A virus (A/chicken/Egypt/N19604C/2021(H9N2)) <span>6</span>- NCBI Accessions: <span>7</span> - PB2: ON374267.1 <span>8</span> - PB1: ON374268.1 <span>9</span> - PA: ON374269.1 <span>10</span> - HA: ON374270.1 (Modified with mutations) <span>11</span> - NP: ON374271.1 <span>12</span> - NA: ON374272.1 <span>13</span> - M: ON374273.1 <span>14</span> - NS: ON374266.1 <span>15</span> <span>16</span>#### Modifications <span>17</span>- Mutation rate: 0.1% (introduced using wgsim) <span>18</span>- Error rate: 0.1% <span>19</span>- Coverage: 10x <span>20</span>- Read length: 150bp <span>21</span>- Sequencing profile: HiSeq 2500 <span>22</span> <span>23</span>### Wastewater Component <span>24</span>- Source: Global Urban Virome Project <span>25</span>- Accession: ERR2734409 <span>26</span>- Original composition preserved <span>27</span>- Represents typical urban wastewater viral diversity <span>28</span> <span>29</span>## Directory Structure</code></div> </div> </div> </div> <p>combined_data/<br>├── input/<br>│ ├── fasta/<br>│ │ ├── h5n1.fasta<br>│ │ └── wastewater.fasta<br>│ ├── h5n1/<br>│ │ ├── final_reads.fq<br>│ │ └── h5n1_complete.fasta<br>│ ├── uncompressed/<br>│ │ └── ERR2734409.fastq<br>│ └── wastewater/<br>│ └── ERR2734409.fastq.gz<br>├── scripts/<br>│ ├── combine_segments.py<br>│ ├── convert_fastq.py<br>│ └── workflow.sh<br>└── README.md</p> <div> <div> <div> </div> </div> <div> <div> <div><code><span>1</span> <span>2</span>## Dataset Creation Method <span>3</span> <span>4</span>### 1. H5N1 Sequence Preparation <span>5</span>```bash <span>6</span># Introduce mutations in HA segment <span>7</span>wgsim -N 100000 -e 0.001 -r 0.001 -R 0.0 ON374266.1.fasta HA_mutated_1.fq <span>8</span> <span>9</span># Convert mutated sequence to FASTA <span>10</span>python3 convert_fastq.py <span>11</span> <span>12</span># Combine all segments <span>13</span>python3 combine_segments.py</code></div> </div> </div> </div> <h3>2. Read Simulation</h3> <div> <div>BASH <div> </div> </div> <div> <div> <div><code><span>1</span><span># Generate Illumina reads using ART</span> <span>2</span>art_illumina -i h5n1_complete.fasta -l <span>150</span> -ss HS25 -f <span>10</span> -sam -nf <span>0</span> -o final_reads</code></div> </div> </div> </div> <h3>3. Dataset Combination</h3> <div> <div>BASH <div> </div> </div> <div> <div> <div><code><span>1</span><span># Combine H5N1 and wastewater reads</span> <span>2</span><span>cat</span> final_reads.fq ERR2734409.fastq <span>></span> combined_reads.fastq</code></div> </div> </div> </div> <h2>Validation Results</h2> <div> <div>Python <div> </div> </div> <div> <div> <div><code><span>1</span>Dataset Statistics<span>:</span> <span>2</span><span>==</span><span>==</span><span>==</span><span>==</span><span>==</span><span>==</span><span>==</span><span>==</span><span>==</span> <span>3</span>Total reads<span>:</span> <span>707</span><span>,</span><span>830</span> <span>4</span>H5N1<span>:</span> <span>1</span><span>,</span><span>120</span> reads <span>(</span><span>0.16</span><span>%</span><span>)</span> <span>5</span>Wastewater<span>:</span> <span>706</span><span>,</span><span>710</span> reads <span>(</span><span>99.84</span><span>%</span><span>)</span> <span>6</span> <span>7</span>Read Lengths<span>:</span> <span>8</span>Mean<span>:</span> <span>150.0</span> <span>9</span>Std dev<span>:</span> <span>0.0</span> <span>10</span> <span>11</span>GC Content<span>:</span> <span>12</span>Mean<span>:</span> <span>47.2</span><span>%</span> <span>13</span>Std dev<span>:</span> <span>8.4</span><span>%</span></code></div> </div> </div> </div> <h2>Citations</h2> <ol> <li>Nieuwenhuijse, D.F., Oude Munnink, B.B., Phan, M.V.T. et al. Setting a baseline for global urban virome surveillance in sewage. Sci Rep 10, 13748 (2020). <a href="https://doi.org/10.1038/s41598-020-69869-0">https://doi.org/10.1038/s41598-020-69869-0</a></li> <li>Li H. wgsim - Read simulator for next generation sequencing. (2011).</li> <li>Huang W, Li L, Myers JR, Marth GT. ART: a next-generation sequencing read simulator. Bioinformatics. 2012 Feb 15;28(4):593-4.</li> </ol>