Skip to content

Latest commit

 

History

History
164 lines (151 loc) · 9.27 KB

File metadata and controls

164 lines (151 loc) · 9.27 KB

Intel Optimized GATK4 Germline SNPs and Indels Variant Calling Workflow.

WORKFLOWS AND JSONS

This repository contains a few different files - each tuned for certain requirements.

├── Exome_2T_PairedSingleSampleWf_optimized.inputs.json WES Throughput JSON file
├── Exome_56T_PairedSingleSampleWf_optimized.inputs.json WES Latency JSON file
├── Exome_PairedSingleSampleWf_noqc_nocram_optimized.wdl WES WDL optimized for on-prem ├── Latency_PairedSingleSampleWf_HT_384GB.json WGS Latency JSON file with HT on
├── Latency_PairedSingleSampleWf_NO_HT_384GB.json WGS Latency JSON file with HT off
├── Throughput_PairedSingleSampleWf_HT_384GB.json WGS Througphput JSON file with HT off
├── Throughput_PairedSingleSampleWf_NO_HT_384GB.json WGS Throughput JSON file with HT off
├── PairedSingleSampleWf_noqc_nocram_optimized.wdl WGS WDL optimized for on-prem
├── PairedSingleSampleWf_noqc_nocram_withcleanup_optimized.wdl WGS WDL optimized for on-prem benchmarking
Modify the following lines in the WDL files to reflect the paths where datasets reside in your cluster:

In the JSON files, modify the paths to the datasets and tools where they reside in your cluster.
Example: modify Latency_PairedSingleSampleWf_optimized.inputs.json for tools directory.

For improved throuput perfomance of WGS processing it is recomned uncomment the "backend" configuraoitn and setup 4 Cromwell Queues. 4 Queue aproach with cpu and memory level allocation support. Local: Run the first 3 basic tasks on local and seralize the workflows. BWA: Run BWA low priority on all nodes (let BWA run on 1/2 of the nodes untill their work is done) All: 1/2 nodes for everything else with high priority Haplo: 1/2 All node at mid priorty for Haplotype

DATASETS

The datasets used for the WGS workflow turning can be obtained from: https://console.cloud.google.com/storage/browser/broad-public-datasets/NA12878/unmapped/.

Contact Broad/Intel for access to the WES data needed for this workflow.

The other reference files and resource files can be downloaded from:

Datasets Recommended for Setup and Testing this workflow
Data Type  Filename  File Path
Reference
Genome
ref_dict  Homo_sapiens_assembly38.dict https://console.cloud.google.com/storage/browser/broad-references/hg38/v0
ref_fasta  Homo_sapiens_assembly38.fasta
ref_fasta_index  Homo_sapiens_assembly38.fasta.fai
ref_alt  Homo_sapiens_assembly38.fasta.64.alt
ref_sa  Homo_sapiens_assembly38.fasta.64.sa
ref_amb  Homo_sapiens_assembly38.fasta.64.amb
ref_bwt  Homo_sapiens_assembly38.fasta.64.bwt
ref_ann  Homo_sapiens_assembly38.fasta.64.ann
ref_pac  Homo_sapiens_assembly38.fasta.64.pac
contamination_sites_ud Homo_sapiens_assembly38.contam.UD
contamination_sites_bed Homo_sapiens_assembly38.contam.bed
contamination_sites_mu Homo_sapiens_assembly38.contam.mu
Resource
Files
dbSNP_vcf  Homo_sapiens_assembly38.dbsnp138.vcf
dbSNP_vcf_index  Homo_sapiens_assembly38.dbsnp138.vcf.idx
known_snps_sites_vcf Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
known_snps_sites_vcf_index Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi
known_indels_sites_VCFs Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
Homo_sapiens_assembly38.known_indels.vcf.gz
known_indels_sites_indices Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi
Homo_sapiens_assembly38.known_indels.vcf.gz.tbi
Interval
Files
wgs_calling_interval_list  wgs_calling_regions.hg38.interval_list *SEE NOTE BELOW
wgs_coverage_interval_list  wgs_coverage_regions.hg38.interval_list
wgs_evaluation_interval_list  wgs_evaluation_regions.hg38.interval_list
Small Test
Input
Datasets
flowcell_unmapped_bams H06HDADXX130110.1.ATCACGAT.20k_reads.bam 

https://console.cloud.google.com/storage/browser/genomics-public-data/test-data/dna/wgs/hiseq2500/NA12878/

H06HDADXX130110.2.ATCACGAT.20k_reads.bam
H06JUADXX130110.1.ATCACGAT.20k_reads.bam

NOTE: The Exome Interval file whole_exome_illumina_coding_v1.Homo_sapiens_assembly38.targets.interval_list is hosted at https://console.cloud.google.com/storage/browser/gatk-test-data/intervals/.

TOOLS

For on-prem, the workflow uses non-dockerized tools:

GATK Version can be download from here: https://github.com/broadinstitute/gatk/releases
SAMTools can be downloaded from here: http://www.htslib.org/download/
Picard tool can be downloaded here: https://broadinstitute.github.io/picard/