Tallennettuna:
| Päätekijä: | |
|---|---|
| Aineistotyyppi: | Recurso digital |
| Kieli: | |
| Julkaistu: |
Zenodo
2024
|
| Linkit: | https://doi.org/10.5281/zenodo.16414004 |
| Tagit: |
Lisää tagi
Ei tageja, Lisää ensimmäinen tagi!
|
Sisällysluettelo:
- <h1>PCA_projection</h1> <p>Run PCA and project genotypes into PCA space using pre-made SNP loadings.</p> <h2>create_pca_projection</h2> <p>This workflow is used to create a pca projection from a genetic reference dataset (in VCF format). First, the reference data is subsetted to include only sites in common with a provided reference variant file (intended to contain only variants that one would expect to find in all downstream datsets that will be projected using loadings created in this worflow (e.g., a list of common sites that are easily imputed in TOPMed)), and then pruned for linkage equilibrium. Regions in Table 1 of <a href="https://www.biorxiv.org/content/10.1101/2024.04.02.587682v1">Grinde et al 2024</a> are excluded prior to LD pruning. Related individuals are removed. Then <a href="https://github.com/gabraham/flashpca">FlashPCA</a> is run on the dataset.</p> <p>Inputs:</p> <p>input | description --- | --- vcf | Array of VCF files (possibly split by chromosome) ref_variants | file with variants to use in the PCA calculation. The column with variant IDs should be labeled 'ID'. n_pcs | Number of PCs to return genome_build | Genome build for selecting regions to exclude. Allowed values are 38 (default), 37, 36 prune_variants | Boolean for whether to do LD pruning on the variants (default true) min_maf | minimum MAF for variants to include (optional) remove_relateds | Boolean for whether to remove samples with relatedness above max_kinship_coefficient (default true) max_kinship_coefficient | if remove_relateds is true, remove one of each pair of samples with kinship > this value (default 0.0442 = 3rd degree relatives) window_size | window size for LD pruning (default 10,000) shift_size | shift size for LD pruning (default 1000) r2_threshold | r2 threshold for LD pruning (default 0.1) groups_file | Two-column tsv file of subject_id and group, used to label plots (optional) relatedness_estimator | If removing relatedness, the type of estimator to use when running KING. Either "robust" or "ibdseg". (default "robust")</p> <p>Outputs:</p> <p>output | description --- | --- pcs | PCs for samples used to create projection pc_variance | variance explained by each PC pc_loadings | SNP loadings mean_sd | mean and SD for each variant in SNP loadings file eigenvectors | eigenvectors eigenvalues | eigenvalues loadings_log | log from running plink2 --pca pca_projection | PCs from running PCA on this dataset with calculated loadings projection_log | log from running plink2 --score pca_plots_pc12 | png file of PC1 and PC2 scatterplot pca_plot_pairs | png file of pairwise PC scatterplots pca_plots_parcoord | png file of parallel coordinates plot for PCs pca_plots | html file with PCA plots</p> <h2>projected_PCA</h2> <p>This workflow is used to project a genetic test dataset (in VCF format) into PCA space using user-defined allele loadings. First, the allele loadings (from the create_pca_projection workflow) and the test dataset are both subset to contain the same set of variants (Note: this workflow assumes that variants from both the loadings and test dataset have been previously harmonized such that variants follow the same naming convention.) Then the test dataset is projected onto the principal components.</p> <p>If you get an error that 0 variants remain in the subsetVariants task, try setting the argument subsetVariants.set_var_ids to 'false'.</p> <p>Inputs:</p> <p>input | description --- | --- ref_loadings | File with SNP loadings (e.g. pc_loadings output from create_pca_projection) ref_meansd | File with variant mean and SD (e.g. mean_sd output from create_pca_projection) ref_pcs | PCs from running PCA on reference dataset to create joint plots (optional) ref_groups | Two-column tsv file of subject_id and group from reference dataset, used to label plots (optional) groups_file | Two-column tsv file of subject_id and group from sample dataset, used to label plots (optional) vcf | Array of VCF files (possibly split by chromosome) min_overlap | minimum overlap between variants in loadings and vcf files (default 0.95). If the overlap is less than this threshold, PCA will not be run and the workflow will exit. sample_file | A file containing the set of sample ids to include (optional) variant_file | A file containing the set of variants to include in the projection. If passed, the intersection between variants in this file and in the reference panel will be used. (optional)</p> <p>Outputs:</p> <p>output | description --- | --- projection_file | PCs from running PCA on this dataset with ref_loadings projection_log | log from running plink2 --score pca_plots_pc12 | png file of PC1 and PC2 scatterplot of samples pca_plot_pairs | png file of pairwise PC scatterplots of samples pca_plots_parcoord | png file of parallel coordinates plot for PCs of samples pca_plots | html file with PCA plots of samples pca_plots_pc12_ref | png file of PC1 and PC2 scatterplot of samples overlaid on references pca_plot_pairs_ref | png file of pairwise PC scatterplots of samples overlaid on references pca_plots_parcoord_ref | png file of parallel coordinates plot for PCs of samples overlaid on references pca_plots_ref | html file with PCA plots of samples overlaid on references</p> <h2>LD_pruning</h2> <p>This workflow prunes variants for linkage equilibrium.</p> <p>Inputs:</p> <p>input | description --- | --- vcf | Array of VCF files (possibly split by chromosome) variant_file | Optional file with variant selection to start the pruning variant_id_col | Column in variant_file containing the IDs min_maf | minimum MAF for variants to include (optional) snps_only | Boolean for whether to use only SNPs (default true) window_size | window size for LD pruning (default 10,000) shift_size | shift size for LD pruning (default 1000) r2_threshold | r2 threshold for LD pruning (default 0.1)</p> <p>Outputs:</p> <p>output | description --- | --- pruned_vcf | Array of pruned VCF files</p> <h2>select_variants_by_pop_maf</h2> <p>This workflow selects all variants with MAF > a minimum threshold in any population (i.e. the union of filtering by MAF in each population separately). Samples to select for each population are identified by reading the population_descriptor and sample tables from the specified workspace.</p> <p>The output of this workflow is a text file with variant IDs, taken from the ID column of the VCF file(s). Any missing values in the ID column are replaced with chr:pos:ref:alt. Duplicate variant IDs are excluded.</p> <p>Inputs:</p> <p>input | description --- | --- vcf | Array of VCF files (possibly split by chromosome) min_maf | minimum MAF for variants to select population_descriptor | the descriptor to use for identifying populations population_labels | Array of labels for each population. If this input is not supplied, the workflow will use all unique labels for the population descriptor. workspace_name | name of the workspace with a population_descriptor data table (e.g. "PRIMED_1000G") workspace_namespace | namespace of the workspace (e.g. "primed-data-cc-1")</p> <p>Outputs:</p> <p>output | description --- | --- maf_filtered_variants | Text file with variants that passed the MAF filter in any population.</p> <h2>pca_plots</h2> <p>This workflow uses a file with PCs to create pairs plots and parallel coordinate plots.</p> <p>Inputs:</p> <p>input | description --- | --- data_file | PCs from running PCA (ie pcs from create_pca_projection or projection_file from projected_pca) groups_file | Two-column tsv file of subject_id and group, used to label plots (optional) colormap | Two-column tsv file of group and color, used to color plots (optional) n_pairs | number of PCs to use for pairs plots (default 10)</p> <p>Outputs:</p> <p>output | description --- | --- pca_plots_pc12 | png file of PC1 and PC2 scatterplot pca_plot_pairs | png file of pairwise PC scatterplots pca_plots_parcoord | png file of parallel coordinates plot for PCs pca_plots | html file with PCA plots</p>