Gardado en:
Detalles Bibliográficos
Autor Principal: broadinstitute
Formato: Recurso digital
Idioma:
Publicado: Zenodo 2021
Acceso en liña:https://doi.org/10.5281/zenodo.16412356
Tags: Engadir etiqueta
Sen Etiquetas, Sexa o primeiro en etiquetar este rexistro!
Table of Contents:
  • <h1>GATK-SV</h1> <p>A structural variation discovery pipeline for Illumina short-read whole-genome sequencing (WGS) data.</p> <h2>Table of Contents</h2> <ul> <li><a href="#requirements">Requirements</a></li> <li><a href="#citation">Citation</a></li> <li><a href="#quickstart">Quickstart</a></li> <li><a href="#overview">Pipeline Overview</a> <ul> <li><a href="#cohort-mode">Cohort mode</a></li> <li><a href="#single-sample-mode">Single-sample mode</a></li> <li><a href="#gcnv-training-overview">gCNV model</a></li> </ul> </li> <li><a href="#descriptions">Module Descriptions</a> <ul> <li><a href="#module00a">Module 00a</a> - Raw callers and evidence collection</li> <li><a href="#module00b">Module 00b</a> - Batch QC</li> <li><a href="#gcnv-training">gCNV training</a> - gCNV model creation</li> <li><a href="#module00c">Module 00c</a> - Batch evidence merging, BAF generation, and depth callers</li> <li><a href="#module01">Module 01</a> - Site clustering</li> <li><a href="#module02">Module 02</a> - Site metrics</li> <li><a href="#module03">Module 03</a> - Filtering</li> <li><a href="#gather-vcfs">Gather Cohort VCFs</a> - Cross-batch site merging</li> <li><a href="#module04">Module 04</a> - Genotyping</li> <li><a href="#module04b">Module 04b</a> - Genotype refinement (optional)</li> <li><a href="#module0506">Module 05/06</a> - Cross-batch integration, complex event resolution, and VCF cleanup</li> <li><a href="#module07">Module 07</a> - Downstream Filtering</li> <li><a href="#module08">Module 08</a> - Annotation</li> <li><a href="#module09">Module 09</a> - QC and Visualization</li> <li>Additional modules - Mosaic and de novo</li> </ul> </li> <li><a href="#troubleshooting">Troubleshooting</a></li> </ul> <h2><a name="requirements">Requirements</a></h2> <h3>Deployment and execution:</h3> <ul> <li>A <a href="https://cloud.google.com/">Google Cloud</a> account.</li> <li>A workflow execution system supporting the <a href="https://openwdl.org/">Workflow Description Language</a> (WDL), either: <ul> <li><a href="https://github.com/broadinstitute/cromwell">Cromwell</a> (v36 or higher). A dedicated server is highly recommended.</li> <li>or <a href="https://terra.bio/">Terra</a> (note preconfigured GATK-SV workflows are not yet available for this platform)</li> </ul> </li> <li>Recommended: <a href="https://melt.igs.umaryland.edu/">MELT</a>. Due to licensing restrictions, we cannot provide a public docker image or reference panel VCFs for this algorithm.</li> <li>Recommended: <a href="https://github.com/broadinstitute/cromshell">cromshell</a> for interacting with a dedicated Cromwell server.</li> <li>Recommended: <a href="https://cromwell.readthedocs.io/en/stable/WOMtool/">WOMtool</a> for validating WDL/json files.</li> </ul> <h3>Data:</h3> <ul> <li>Illumina short-read whole-genome CRAMs or BAMs, aligned to hg38 with <a href="https://github.com/lh3/bwa">bwa-mem</a>. BAMs must also be indexed.</li> <li>Indexed GVCFs produced by GATK HaplotypeCaller, or a jointly genotyped VCF.</li> <li>Family structure definitions file in <a href="https://gatk.broadinstitute.org/hc/en-us/articles/360035531972-PED-Pedigree-format">PED format</a>. Sex aneuploidies (detected in <a href="#module00b">Module 00b</a>) should be entered as sex = 0.</li> </ul> <h4><a name="sampleids">Sample ID requirements:</a></h4> <p>Sample IDs must:</p> <ul> <li>Be unique within the cohort</li> <li>Contain only alphanumeric characters and underscores (no dashes, whitespace, or special characters)</li> </ul> <p>Sample IDs should not:</p> <ul> <li>Contain only numeric characters</li> <li>Be a substring of another sample ID in the same cohort</li> <li>Contain any of the following substrings: <code>chr</code>, <code>name</code>, <code>DEL</code>, <code>DUP</code>, <code>CPX</code>, <code>CHROM</code></li> </ul> <p>The same requirements apply to family IDs in the PED file, as well as batch IDs and the cohort ID provided as workflow inputs.</p> <p>Sample IDs are provided to <a href="#module00a">Module00a</a> directly and need not match sample names from the BAM/CRAM headers or GVCFs. <code>GetSampleID.wdl</code> can be used to fetch BAM sample IDs and also generates a set of alternate IDs that are considered safe for this pipeline; alternatively, <a href="https://github.com/talkowski-lab/gnomad_sv_v3/blob/master/sample_id/convert_sample_ids.py">this script</a> transforms a list of sample IDs to fit these requirements. Currently, sample IDs can be replaced again in <a href="#module00c">Module 00c</a>.</p> <p>The following inputs will need to be updated with the transformed sample IDs:</p> <ul> <li>Sample ID list for <a href="#module00a">Module00a</a> or <a href="#module00c">Module 00c</a></li> <li>PED file</li> </ul> <p>If using a SNP VCF in <a href="#module00c">Module 00c</a>, it does not need to be re-headered; simply provide the <code>vcf_samples</code> argument.</p> <h2><a name="citation">Citation</a></h2> <p>Please cite the following publication: <a href="https://doi.org/10.1038/s41586-020-2287-8">Collins, Brand, et al. 2020. "A structural variation reference for medical and population genetics." Nature 581, 444-451.</a></p> <p>Additional references: <a href="http://dx.doi.org/10.1038/s41588-018-0107-y">Werling et al. 2018. "An analytical framework for whole-genome sequence association studies and its implications for autism spectrum disorder." Nature genetics 50.5, 727-736.</a></p> <h2><a name="quickstart">Quickstart</a></h2> <h4>WDLs</h4> <p>There are two scripts for running the full pipeline:</p> <ul> <li><code>wdl/GATKSVPipelineBatch.wdl</code>: Runs GATK-SV on a batch of samples.</li> <li><code>wdl/GATKSVPipelineSingleSample.wdl</code>: Runs GATK-SV on a single sample, given a reference panel</li> </ul> <h4>Inputs</h4> <p>Example workflow inputs can be found in <code>/inputs</code>. All required resources are available in public Google buckets.</p> <h4>MELT</h4> <p><strong>Important</strong>: The example input files contain MELT inputs that are NOT public (see <a href="#requirements">Requirements</a>). These include:</p> <ul> <li><code>GATKSVPipelineSingleSample.melt_docker</code> and <code>GATKSVPipelineBatch.melt_docker</code> - MELT docker URI (see <a href="https://github.com/talkowski-lab/gatk-sv-v1/blob/master/dockerfiles/README.md">Docker readme</a>)</li> <li><code>GATKSVPipelineSingleSample.ref_std_melt_vcfs</code> - Standardized MELT VCFs (<a href="#module00c">Module00c</a>)</li> </ul> <p>The input values are provided only as an example and are not publicly accessible. In order to include MELT, these values must be provided by the user. MELT can be disabled by deleting these inputs and setting <code>GATKSVPipelineBatch.use_melt</code> to <code>false</code>.</p> <h4>Requester pays buckets</h4> <p><strong>Important</strong>: The following parameters must be set when certain input data is in requester pays (RP) buckets:</p> <ul> <li><code>GATKSVPipelineSingleSample.requester_pays_cram</code> and <code>GATKSVPipelineBatch.Module00aBatch.requester_pays_crams</code> - set to <code>True</code> if inputs are CRAM format and in an RP bucket, otherwise <code>False</code>.</li> <li><code>GATKSVPipelineBatch.GATKSVPipelinePhase1.gcs_project_for_requester_pays</code> - set to your Google Cloud Project ID if gVCFs are in an RP bucket, otherwise omit this parameter.</li> </ul> <h4>Execution</h4> <p>We recommend running the pipeline on a dedicated <a href="https://github.com/broadinstitute/cromwell">Cromwell</a> server with a <a href="https://github.com/broadinstitute/cromshell">cromshell</a> client. A batch run can be started with the following commands:</p> <pre><code>> mkdir gatksv_run && cd gatksv_run > mkdir wdl && cd wdl > cp $GATK_SV_V1_ROOT/wdl/*.wdl . > zip dep.zip *.wdl > cd .. > cp $GATK_SV_V1_ROOT/inputs/GATKSVPipelineBatch.ref_panel_1kg.json GATKSVPipelineBatch.my_run.json > cromshell submit wdl/GATKSVPipelineBatch.wdl GATKSVPipelineBatch.my_run.json cromwell_config.json wdl/dep.zip </code></pre> <p>where <code>cromwell_config.json</code> is a Cromwell <a href="https://cromwell.readthedocs.io/en/stable/wf_options/Overview/">workflow options file</a>. Note users will need to re-populate batch/sample-specific parameters (e.g. BAMs and sample IDs).</p> <h2><a name="overview">Pipeline Overview</a></h2> <p>The pipeline consists of a series of modules that perform the following:</p> <ul> <li><a href="#module00a">Module 00a</a>: SV evidence collection, including calls from a configurable set of algorithms (Delly, Manta, MELT, and Wham), read depth (RD), split read positions (SR), and discordant pair positions (PE).</li> <li><a href="#module00b">Module 00b</a>: Dosage bias scoring and ploidy estimation</li> <li><a href="#module00c">Module 00c</a>: Copy number variant calling using cn.MOPS and GATK gCNV; B-allele frequency (BAF) generation; call and evidence aggregation</li> <li><a href="#module01">Module 01</a>: Variant clustering</li> <li><a href="#module02">Module 02</a>: Variant filtering metric generation</li> <li><a href="#module03">Module 03</a>: Variant filtering; outlier exclusion</li> <li><a href="#module04">Module 04</a>: Genotyping</li> <li><a href="#module0506">Module 05/06</a>: Cross-batch integration; complex variant resolution and re-genotyping; vcf cleanup</li> <li><a href="#module07">Module 07</a>: Downstream filtering, including minGQ, batch effect check, outlier samples removal and final recalibration;</li> <li><a href="#module08">Module 08</a>: Annotations, including functional annotation, allele frequency (AF) annotation and AF annotation with external population callsets;</li> <li><a href="#module09">Module 09</a>: Visualization, including scripts that generates IGV screenshots and rd plots.</li> <li>Additional modules to be added: de novo and mosaic scripts</li> </ul> <p>Repository structure:</p> <ul> <li><code>/inputs</code>: Example workflow parameter files for running gCNV training, GATK-SV batch mode, and GATK-SV single-sample mode</li> <li><code>/dockerfiles</code>: Resources for building pipeline docker images (see <a href="https://github.com/talkowski-lab/gatk-sv-v1/blob/master/dockerfiles/README.md">readme</a>)</li> <li><code>/wdl</code>: WDLs running the pipeline. There is a master WDL for running each module, e.g. <code>Module01.wdl</code>.</li> <li><code>/scripts</code>: scripts for running tests, building dockers, and analyzing cromwell metadata files</li> <li><code>/src</code>: main pipeline scripts <ul> <li><code>/RdTest</code>: scripts for depth testing</li> <li><code>/sv-pipeline</code>: various scripts and packages used throughout the pipeline</li> <li><code>/svqc</code>: Python module for checking that pipeline metrics fall within acceptable limits</li> <li><code>/svtest</code>: Python module for generating various summary metrics from module outputs</li> <li><code>/svtk</code>: Python module of tools for SV-related datafile parsing and analysis</li> <li><code>/WGD</code>: whole-genome dosage scoring scripts</li> </ul> </li> <li><code>/test</code>: WDL test parameter files. Please note that file inputs may not be publicly available.</li> </ul> <h2><a name="cohort-mode">Cohort mode</a></h2> <p>A minimum cohort size of 100 with roughly equal number of males and females is recommended. For modest cohorts (~100-500 samples), the pipeline can be run as a single batch using <code>GATKSVPipelineBatch.wdl</code>.</p> <p>For larger cohorts, samples should be split up into batches of ~100-500 samples. We recommend batching based on overall coverage and dosage score (WGD), which can be generated in <a href="#module00b">Module 00b</a>.</p> <p>The pipeline should be executed as follows:</p> <ul> <li>Modules <a href="#module00a">00a</a> and <a href="#module00b">00b</a> can be run on arbitrary cohort partitions</li> <li>Modules <a href="#module00c">00c</a>, <a href="#module01">01</a>, <a href="#module02">02</a>, and <a href="#module03">03</a> are run separately per batch</li> <li><a href="#module04">Module 04</a> is run separately per batch, using filtered variants (<a href="#module03">Module 03</a> output) combined across all batches</li> <li><a href="#module0506">Module 05/06</a> and beyond are run on all batches together</li> </ul> <p>Note: <a href="#module00c">Module 00c</a> requires a <a href="#gcnv-training">trained gCNV model</a>.</p> <h2><a name="sample-sample-mode">Single-sample mode</a></h2> <p><code>GATKSVPipelineSingleSample.wdl</code> runs the pipeline on a single sample using a fixed reference panel. An example reference panel containing 156 samples from the <a href="https://app.terra.bio/#workspaces/anvil-datastorage/1000G-high-coverage-2019">NYGC 1000G Terra workspace</a> is provided with <code>inputs/GATKSVPipelineSingleSample.ref_panel_1kg.na12878.json</code>.</p> <p>Custom reference panels can be generated by running <code>GATKSVPipelineBatch.wdl</code> and <code>trainGCNV.wdl</code> and using the outputs to replace the following single-sample workflow inputs:</p> <ul> <li><code>GATKSVPipelineSingleSample.ref_ped_file</code> : <code>batch.ped</code> - Manually created (see <a href="#requirements">data requirements</a>)</li> <li><code>GATKSVPipelineSingleSample.contig_ploidy_model_tar</code> : <code>batch-contig-ploidy-model.tar.gz</code> - gCNV contig ploidy model (<a href="#gcnv-training">gCNV training</a>)</li> <li><code>GATKSVPipelineSingleSample.gcnv_model_tars</code> : <code>batch-model-files-*.tar.gz</code> - gCNV model tarballs (<a href="#gcnv-training">gCNV training</a>)</li> <li><code>GATKSVPipelineSingleSample.ref_pesr_disc_files</code> - <code>sample.disc.txt.gz</code> - Paired-end evidence files (<a href="#module00a">Module 00a</a>)</li> <li><code>GATKSVPipelineSingleSample.ref_pesr_split_files</code> - <code>sample.split.txt.gz</code> - Split read evidence files (<a href="#module00a">Module 00a</a>)</li> <li><code>GATKSVPipelineSingleSample.ref_panel_bincov_matrix</code>: <code>batch.RD.txt.gz</code> - Read counts matrix (<a href="#module00c">Module 00c</a>)</li> <li><code>GATKSVPipelineSingleSample.ref_panel_del_bed</code> : <code>batch.DEL.bed.gz</code> - Depth deletion calls (<a href="#module00c">Module 00c</a>)</li> <li><code>GATKSVPipelineSingleSample.ref_panel_dup_bed</code> : <code>batch.DUP.bed.gz</code> - Depth duplication calls (<a href="#module00c">Module 00c</a>)</li> <li><code>GATKSVPipelineSingleSample.ref_samples</code> - Reference panel sample IDs</li> <li><code>GATKSVPipelineSingleSample.ref_std_manta_vcfs</code> - <code>std_XXX.manta.sample.vcf.gz</code> - Standardized Manta VCFs (<a href="#module00c">Module 00c</a>)</li> <li><code>GATKSVPipelineSingleSample.ref_std_melt_vcfs</code> - <code>std_XXX.melt.sample.vcf.gz</code> - Standardized Melt VCFs (<a href="#module00c">Module 00c</a>)</li> <li><code>GATKSVPipelineSingleSample.ref_std_wham_vcfs</code> - <code>std_XXX.wham.sample.vcf.gz</code> - Standardized Wham VCFs (<a href="#module00c">Module 00c</a>)</li> <li><code>GATKSVPipelineSingleSample.cutoffs</code> : <code>batch.cutoffs</code> - Filtering cutoffs (<a href="#module03">Module 03</a>)</li> <li><code>GATKSVPipelineSingleSample.genotype_pesr_pesr_sepcutoff</code> : <code>genotype_pesr.pesr_sepcutoff.txt</code> - Genotyping cutoffs (<a href="#module04">Module 04</a>)</li> <li><code>GATKSVPipelineSingleSample.genotype_pesr_depth_sepcutoff</code> : <code>genotype_pesr.depth_sepcutoff.txt</code> - Genotyping cutoffs (<a href="#module04">Module 04</a>)</li> <li><code>GATKSVPipelineSingleSample.genotype_depth_pesr_sepcutoff</code> : <code>genotype_depth.pesr_sepcutoff.txt</code> - Genotyping cutoffs (<a href="#module04">Module 04</a>)</li> <li><code>GATKSVPipelineSingleSample.genotype_depth_depth_sepcutoff</code> : <code>genotype_depth.depth_sepcutoff.txt</code> - Genotyping cutoffs (<a href="#module04">Module 04</a>)</li> <li><code>GATKSVPipelineSingleSample.PE_metrics</code> : <code>pe_metric_file.txt</code> - Paired-end evidence genotyping metrics (<a href="#module04">Module 04</a>)</li> <li><code>GATKSVPipelineSingleSample.SR_metrics</code> : <code>sr_metric_file.txt</code> - Split read evidence genotyping metrics (<a href="#module04">Module 04</a>)</li> <li><code>GATKSVPipelineSingleSample.ref_panel_vcf</code> : <code>batch.cleaned.vcf.gz</code> - Final output VCF (<a href="#module0506">Module 05/06</a>)</li> </ul> <h2><a name="gcnv-training-overview">gCNV Training</a></h2> <p>Both the cohort and single-sample modes use the GATK gCNV depth calling pipeline, which requires a <a href="#gcnv-training">trained model</a> as input. The samples used for training should be technically homogeneous and similar to the samples to be processed (i.e. same sample type, library prep protocol, sequencer, sequencing center, etc.). The samples to be processed may comprise all or a subset of the training set. For small cohorts, a single gCNV model is usually sufficient. If a cohort contains multiple data sources, we recommend clustering them using the dosage score, and training a separate model for each cluster.</p> <h2><a name="descriptions">Module Descriptions</a></h2> <p>The following sections briefly describe each module and highlights inter-dependent input/output files. Note that input/output mappings can also be gleaned from <code>GATKSVPipelineBatch.wdl</code>, and example input files for each module can be found in <code>/test</code>.</p> <h2><a name="module00a">Module 00a</a></h2> <p>Runs raw evidence collection on each sample.</p> <p>Note: a list of sample IDs must be provided. Refer to the <a href="#sampleids">sample ID requirements</a> for specifications of allowable sample IDs. IDs that do not meet these requirements may cause errors.</p> <h4>Inputs:</h4> <ul> <li>Per-sample BAM or CRAM files aligned to hg38. Index files (<code>.bai</code>) must be provided if using BAMs.</li> </ul> <h4>Outputs:</h4> <ul> <li>Caller VCFs (Delly, Manta, MELT, and/or Wham)</li> <li>Binned read counts file</li> <li>Split reads (SR) file</li> <li>Discordant read pairs (PE) file</li> <li>B-allele fraction (BAF) file</li> </ul> <h2><a name="module00b">Module 00b</a></h2> <p>Runs ploidy estimation, dosage scoring, and optionally VCF QC. The results from this module can be used for QC and batching.</p> <p>For large cohorts, we recommend dividing samples into smaller batches (~500 samples) with ~1:1 male:female ratio.</p> <p>We also recommend using sex assignments generated from the ploidy estimates and incorporating them into the PED file.</p> <h4>Prerequisites:</h4> <ul> <li><a href="#module00a">Module 00a</a></li> </ul> <h4>Inputs:</h4> <ul> <li>Read count files (<a href="#module00a">Module 00a</a>)</li> <li>(Optional) SV call VCFs (<a href="#module00a">Module 00a</a>)</li> </ul> <h4>Outputs:</h4> <ul> <li>Per-sample dosage scores with plots</li> <li>Ploidy estimates, sex assignments, with plots</li> <li>(Optional) Outlier samples detected by call counts</li> </ul> <h2><a name="gcnv-training">gCNV Training</a></h2> <p>Trains a gCNV model for use in <a href="#module00c">Module 00c</a>. The WDL can be found at <code>/gcnv/trainGCNV.wdl</code>.</p> <h4>Prerequisites:</h4> <ul> <li><a href="#module00a">Module 00a</a></li> <li>(Recommended) <a href="#module00b">Module 00b</a></li> </ul> <h4>Inputs:</h4> <ul> <li>Read count files (<a href="#module00a">Module 00a</a>)</li> </ul> <h4>Outputs:</h4> <ul> <li>Contig ploidy model tarball</li> <li>gCNV model tarballs</li> </ul> <h2><a name="module00c">Module 00c</a></h2> <p>Runs CNV callers (cnMOPs, GATK gCNV) and combines single-sample raw evidence into a batch. See <a href="#cohort-mode">above</a> for more information on batching.</p> <h4>Prerequisites:</h4> <ul> <li><a href="#module00a">Module 00a</a></li> <li>(Recommended) <a href="#module00b">Module 00b</a></li> <li>gCNV training</li> </ul> <h4>Inputs:</h4> <ul> <li>PED file (updated with <a href="#module00b">Module 00b</a> sex assignments, including sex = 0 for sex aneuploidies. Calls will not be made on sex chromosomes when sex = 0 in order to avoid generating many confusing calls or upsetting normalized copy numbers for the batch.)</li> <li>Per-sample GVCFs generated with HaplotypeCaller (<code>gvcfs</code> input), or a jointly-genotyped VCF (position-sharded, <code>snp_vcfs</code> input or <code>snp_vcfs_shard_list</code> input)</li> <li>Read count, BAF, PE, and SR files (<a href="#module00a">Module 00a</a>)</li> <li>Caller VCFs (<a href="#module00a">Module 00a</a>)</li> <li>Contig ploidy model and gCNV model files (gCNV training)</li> </ul> <h4>Outputs:</h4> <ul> <li>Combined read count matrix, SR, PE, and BAF files</li> <li>Standardized call VCFs</li> <li>Depth-only (DEL/DUP) calls</li> <li>Per-sample median coverage estimates</li> <li>(Optional) Evidence QC plots</li> </ul> <h2><a name="module01">Module 01</a></h2> <p>Clusters SV calls across a batch.</p> <h4>Prerequisites:</h4> <ul> <li><a href="#module00c">Module 00c</a></li> </ul> <h4>Inputs:</h4> <ul> <li>Standardized call VCFs (<a href="#module00c">Module 00c</a>)</li> <li>Depth-only (DEL/DUP) calls (<a href="#module00c">Module 00c</a>)</li> </ul> <h4>Outputs:</h4> <ul> <li>Clustered SV VCFs</li> <li>Clustered depth-only call VCF</li> </ul> <h2><a name="module02">Module 02</a></h2> <p>Generates variant metrics for filtering.</p> <h4>Prerequisites:</h4> <ul> <li><a href="#module01">Module 01</a></li> </ul> <h4>Inputs:</h4> <ul> <li>Combined read count matrix, SR, PE, and BAF files (<a href="#module00c">Module 00c</a>)</li> <li>Per-sample median coverage estimates (<a href="#module00c">Module 00c</a>)</li> <li>Clustered SV VCFs (<a href="#module01">Module 01</a>)</li> <li>Clustered depth-only call VCF (<a href="#module01">Module 01</a>)</li> </ul> <h4>Outputs:</h4> <ul> <li>Metrics file</li> </ul> <h2><a name="module02">Module 03</a></h2> <p>Filters poor quality variants and filters outlier samples.</p> <h4>Prerequisites:</h4> <ul> <li><a href="#module02">Module 02</a></li> </ul> <h4>Inputs:</h4> <ul> <li>Batch PED file</li> <li>Metrics file (<a href="#module02">Module 02</a>)</li> <li>Clustered SV and depth-only call VCFs (<a href="#module01">Module 01</a>)</li> </ul> <h4>Outputs:</h4> <ul> <li>Filtered SV (non-depth-only a.k.a. "PESR") VCF with outlier samples excluded</li> <li>Filtered depth-only call VCF with outlier samples excluded</li> <li>Random forest cutoffs file</li> <li>PED file with outlier samples excluded</li> </ul> <h2><a name="module04">Merge Cohort VCFs</a></h2> <p>Combines filtered variants across batches. The WDL can be found at: <code>/wdl/MergeCohortVcfs.wdl</code>.</p> <h4>Prerequisites:</h4> <ul> <li><a href="#module03">Module 03</a></li> </ul> <h4>Inputs:</h4> <ul> <li>List of filtered PESR VCFs (<a href="#module03">Module 03</a>)</li> <li>List of filtered depth VCFs (<a href="#module03">Module 03</a>)</li> </ul> <h4>Outputs:</h4> <ul> <li>Combined cohort PESR and depth VCFs</li> <li>Cohort and clustered depth variant BED files</li> </ul> <h2><a name="module04">Module 04</a></h2> <p>Genotypes a batch of samples across unfiltered variants combined across all batches.</p> <h4>Prerequisites:</h4> <ul> <li><a href="#module03">Module 03</a></li> <li>Merge Cohort VCFs</li> </ul> <h4>Inputs:</h4> <ul> <li>Batch PESR and depth VCFs (<a href="#module03">Module 03</a>)</li> <li>Cohort PESR and depth VCFs (Merge Cohort VCFs)</li> <li>Batch read count, PE, and SR files (<a href="#module00c">Module 00c</a>)</li> </ul> <h4>Outputs:</h4> <ul> <li>Filtered SV (non-depth-only a.k.a. "PESR") VCF with outlier samples excluded</li> <li>Filtered depth-only call VCF with outlier samples excluded</li> <li>PED file with outlier samples excluded</li> <li>List of SR pass variants</li> <li>List of SR fail variants</li> <li>(Optional) Depth re-genotyping intervals list</li> </ul> <h2><a name="module04b">Module 04b</a></h2> <p>Re-genotypes probable mosaic variants across multiple batches.</p> <h4>Prerequisites:</h4> <ul> <li><a href="#module04">Module 04</a></li> </ul> <h4>Inputs:</h4> <ul> <li>Per-sample median coverage estimates (<a href="#module00c">Module 00c</a>)</li> <li>Pre-genotyping depth VCFs (<a href="#module03">Module 03</a>)</li> <li>Batch PED files (<a href="#module03">Module 03</a>)</li> <li>Clustered depth variant BED file (Merge Cohort VCFs)</li> <li>Cohort depth VCF (Merge Cohort VCFs)</li> <li>Genotyped depth VCFs (<a href="#module04">Module 04</a>)</li> <li>Genotyped depth RD cutoffs file (<a href="#module04">Module 04</a>)</li> </ul> <h4>Outputs:</h4> <ul> <li>Re-genotyped depth VCFs</li> </ul> <h2><a name="module0506">Module 05/06</a></h2> <p>Combines variants across multiple batches, resolves complex variants, re-genotypes, and performs final VCF clean-up.</p> <h4>Prerequisites:</h4> <ul> <li><a href="#module04">Module 04</a></li> <li>(Optional) <a href="#module04b">Module 04b</a></li> </ul> <h4>Inputs:</h4> <ul> <li>RD, PE and SR file URIs (<a href="#module00c">Module 00c</a>)</li> <li>Batch filtered PED file URIs (<a href="#module03">Module 03</a>)</li> <li>Genotyped PESR VCF URIs (<a href="#module04">Module 04</a>)</li> <li>Genotyped depth VCF URIs (<a href="#module04">Module 04</a> or <a href="#module04b">04b</a>)</li> <li>SR pass variant file URIs (<a href="#module04">Module 04</a>)</li> <li>SR fail variant file URIs (<a href="#module04">Module 04</a>)</li> <li>Genotyping cutoff file URIs (<a href="#module04">Module 04</a>)</li> <li>Batch IDs</li> <li>Sample ID list URIs</li> </ul> <h4>Outputs:</h4> <ul> <li>Finalized "cleaned" VCF and QC plots</li> </ul> <h2><a name="module07">Module 07</a> (in development)</h2> <p>Apply downstream filtering steps to the cleaned vcf to further control the false discovery rate; all steps are optional and users should decide based on the specific purpose of their projects.</p> <p>Filterings methods include:</p> <ul> <li>minGQ - remove variants based on the genotype quality across populations. Note: Trio families are required to build the minGQ filtering model in this step. We provide tables pre-trained with the 1000 genomes samples at different FDR thresholds for projects that lack family structures, and they can be found here:</li> </ul> <pre><code>gs://gatk-sv-resources-public/hg38/v0/sv-resources/ref-panel/1KG/v2/mingq/1KGP_2504_and_698_with_GIAB.10perc_fdr.PCRMINUS.minGQ.filter_lookup_table.txt gs://gatk-sv-resources-public/hg38/v0/sv-resources/ref-panel/1KG/v2/mingq/1KGP_2504_and_698_with_GIAB.1perc_fdr.PCRMINUS.minGQ.filter_lookup_table.txt gs://gatk-sv-resources-public/hg38/v0/sv-resources/ref-panel/1KG/v2/mingq/1KGP_2504_and_698_with_GIAB.5perc_fdr.PCRMINUS.minGQ.filter_lookup_table.txt </code></pre> <ul> <li>BatchEffect - remove variants that show significant discrepancies in allele frequencies across batches</li> <li>FilterOutlierSamples - remove outlier samples with unusually high or low number of SVs</li> <li>FilterCleanupQualRecalibration - sanitize filter columns and recalibrate variant QUAL scores for easier interpretation</li> </ul> <h2><a name="module08">Module 08</a> (in development)</h2> <p>Add annotations, such as the inferred function and allele frequencies of variants, to final vcf.</p> <p>Annotations methods include:</p> <ul> <li>Functional annotation - annotate SVs with inferred function on protein coding regions, regulatory regions such as UTR and Promoters and other non coding elements;</li> <li>Allele Frequency annotation - annotate SVs with their allele frequencies across all samples, and samples of specific sex, as well as specific sub-populations.</li> <li>Allele Frequency annotation with external callset - annotate SVs with the allele frequencies of their overlapping SVs in another callset, eg. gnomad SV callset.</li> </ul> <h2><a name="module09">Module 09</a> (in development)</h2> <p>Visualize SVs with <a href="http://software.broadinstitute.org/software/igv/">IGV</a> screenshots and read depth plots.</p> <p>Visualization methods include:</p> <ul> <li>RD Visualization - generate RD plots across all samples, ideal for visualizing large CNVs.</li> <li>IGV Visualization - generate IGV plots of each SV for individual sample, ideal for visualizing de novo small SVs.</li> <li>Module09.visualize.wdl - generate RD plots and IGV plots, and combine them for easy review.</li> </ul> <h2><a name="troubleshooting">Troubleshooting</a></h2> <h3>VM runs out of memory or disk</h3> <ul> <li>Default pipeline settings are tuned for batches of 100 samples. Larger batches or cohorts may require additional VM resources. Most runtime attributes can be modified through the <code>RuntimeAttr</code> inputs. These are formatted like this in the json:</li> </ul> <pre><code>"ModuleX.runtime_attr_override": { "disk_gb": 100, "mem_gb": 16 }, </code></pre> <p>Note that a subset of the struct attributes can be specified. See <code>wdl/Structs.wdl</code> for available attributes.</p>