Intel Optimized GATK4 Germline SNPs and Indels Variant Calling Workflow.

WORKFLOWS AND JSONS

This repository contains a few different files - each tuned for certain requirements.

├── Exome_2T_PairedSingleSampleWf_optimized.inputs.json → WES Throughput JSON file
├── Exome_56T_PairedSingleSampleWf_optimized.inputs.json → WES Latency JSON file
├── Exome_PairedSingleSampleWf_noqc_nocram_optimized.wdl → WES WDL optimized for on-prem ├── Latency_PairedSingleSampleWf_HT_384GB.json → WGS Latency JSON file with HT on
├── Latency_PairedSingleSampleWf_NO_HT_384GB.json → WGS Latency JSON file with HT off
├── Throughput_PairedSingleSampleWf_HT_384GB.json → WGS Througphput JSON file with HT off
├── Throughput_PairedSingleSampleWf_NO_HT_384GB.json → WGS Throughput JSON file with HT off
├── PairedSingleSampleWf_noqc_nocram_optimized.wdl → WGS WDL optimized for on-prem
├── PairedSingleSampleWf_noqc_nocram_withcleanup_optimized.wdl → WGS WDL optimized for on-prem benchmarking
Modify the following lines in the WDL files to reflect the paths where datasets reside in your cluster:

PairedSingleSampleWf_noqc_nocram_optimized.wdl
PairedSingleSampleWf_noqc_nocram_withcleanup_optimized.wdl
Exome_PairedSingleSampleWf_noqc_nocram_optimized.wdl

In the JSON files, modify the paths to the datasets and tools where they reside in your cluster.
Example: modify Latency_PairedSingleSampleWf_optimized.inputs.json for tools directory.

For improved throuput perfomance of WGS processing it is recomned uncomment the "backend" configuraoitn and setup 4 Cromwell Queues. 4 Queue aproach with cpu and memory level allocation support. Local: Run the first 3 basic tasks on local and seralize the workflows. BWA: Run BWA low priority on all nodes (let BWA run on 1/2 of the nodes untill their work is done) All: 1/2 nodes for everything else with high priority Haplo: 1/2 All node at mid priorty for Haplotype

DATASETS

The datasets used for the WGS workflow turning can be obtained from: https://console.cloud.google.com/storage/browser/broad-public-datasets/NA12878/unmapped/.

Contact Broad/Intel for access to the WES data needed for this workflow.

The other reference files and resource files can be downloaded from:

Datasets Recommended for Setup and Testing this workflow

Data Type		Filename	File Path
Reference Genome	ref_dict	Homo_sapiens_assembly38.dict	https://console.cloud.google.com/storage/browser/broad-references/hg38/v0
	ref_fasta	Homo_sapiens_assembly38.fasta
	ref_fasta_index	Homo_sapiens_assembly38.fasta.fai
	ref_alt	Homo_sapiens_assembly38.fasta.64.alt
	ref_sa	Homo_sapiens_assembly38.fasta.64.sa
	ref_amb	Homo_sapiens_assembly38.fasta.64.amb
	ref_bwt	Homo_sapiens_assembly38.fasta.64.bwt
	ref_ann	Homo_sapiens_assembly38.fasta.64.ann
	ref_pac	Homo_sapiens_assembly38.fasta.64.pac
	contamination_sites_ud	Homo_sapiens_assembly38.contam.UD
	contamination_sites_bed	Homo_sapiens_assembly38.contam.bed
	contamination_sites_mu	Homo_sapiens_assembly38.contam.mu
Resource Files	dbSNP_vcf	Homo_sapiens_assembly38.dbsnp138.vcf
	dbSNP_vcf_index	Homo_sapiens_assembly38.dbsnp138.vcf.idx
	known_snps_sites_vcf	Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
	known_snps_sites_vcf_index	Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi
	known_indels_sites_VCFs	Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
	known_indels_sites_VCFs	Homo_sapiens_assembly38.known_indels.vcf.gz
	known_indels_sites_indices	Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi
	known_indels_sites_indices	Homo_sapiens_assembly38.known_indels.vcf.gz.tbi
Interval Files	wgs_calling_interval_list	wgs_calling_regions.hg38.interval_list ^{*SEE NOTE BELOW}
	wgs_coverage_interval_list	wgs_coverage_regions.hg38.interval_list
	wgs_evaluation_interval_list	wgs_evaluation_regions.hg38.interval_list
Small Test Input Datasets	flowcell_unmapped_bams	H06HDADXX130110.1.ATCACGAT.20k_reads.bam	https://console.cloud.google.com/storage/browser/genomics-public-data/test-data/dna/wgs/hiseq2500/NA12878/
		H06HDADXX130110.2.ATCACGAT.20k_reads.bam
		H06JUADXX130110.1.ATCACGAT.20k_reads.bam

NOTE: The Exome Interval file whole_exome_illumina_coding_v1.Homo_sapiens_assembly38.targets.interval_list is hosted at https://console.cloud.google.com/storage/browser/gatk-test-data/intervals/.

TOOLS

For on-prem, the workflow uses non-dockerized tools:

GATK Version can be download from here: https://github.com/broadinstitute/gatk/releases
SAMTools can be downloaded from here: http://www.htslib.org/download/
Picard tool can be downloaded here: https://broadinstitute.github.io/picard/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Intel Optimized GATK4 Germline SNPs and Indels Variant Calling Workflow.

WORKFLOWS AND JSONS

DATASETS

TOOLS

Files

README.md

Latest commit

History

README.md

File metadata and controls

Intel Optimized GATK4 Germline SNPs and Indels Variant Calling Workflow.

WORKFLOWS AND JSONS

DATASETS

TOOLS