This repository contains a few different files - each tuned for certain requirements.
├── Exome_2T_PairedSingleSampleWf_optimized.inputs.json → WES Throughput JSON file
├── Exome_56T_PairedSingleSampleWf_optimized.inputs.json → WES Latency JSON file
├── Exome_PairedSingleSampleWf_noqc_nocram_optimized.wdl → WES WDL optimized for on-prem
├── Latency_PairedSingleSampleWf_HT_384GB.json → WGS Latency JSON file with HT on
├── Latency_PairedSingleSampleWf_NO_HT_384GB.json → WGS Latency JSON file with HT off
├── Throughput_PairedSingleSampleWf_HT_384GB.json → WGS Througphput JSON file with HT off
├── Throughput_PairedSingleSampleWf_NO_HT_384GB.json → WGS Throughput JSON file with HT off
├── PairedSingleSampleWf_noqc_nocram_optimized.wdl → WGS WDL optimized for on-prem
├── PairedSingleSampleWf_noqc_nocram_withcleanup_optimized.wdl → WGS WDL optimized for on-prem benchmarking
Modify the following lines in the WDL files to reflect the paths where datasets reside in your cluster:
- PairedSingleSampleWf_noqc_nocram_optimized.wdl
- PairedSingleSampleWf_noqc_nocram_withcleanup_optimized.wdl
- Exome_PairedSingleSampleWf_noqc_nocram_optimized.wdl
In the JSON files, modify the paths to the datasets and tools where they reside in your cluster.
Example: modify Latency_PairedSingleSampleWf_optimized.inputs.json for tools directory.
For improved throuput perfomance of WGS processing it is recomned uncomment the "backend" configuraoitn and setup 4 Cromwell Queues. 4 Queue aproach with cpu and memory level allocation support. Local: Run the first 3 basic tasks on local and seralize the workflows. BWA: Run BWA low priority on all nodes (let BWA run on 1/2 of the nodes untill their work is done) All: 1/2 nodes for everything else with high priority Haplo: 1/2 All node at mid priorty for Haplotype
The datasets used for the WGS workflow turning can be obtained from: https://console.cloud.google.com/storage/browser/broad-public-datasets/NA12878/unmapped/.
Contact Broad/Intel for access to the WES data needed for this workflow.
The other reference files and resource files can be downloaded from:
Data Type | Filename | File Path | |
Reference Genome |
ref_dict | Homo_sapiens_assembly38.dict | https://console.cloud.google.com/storage/browser/broad-references/hg38/v0 |
ref_fasta | Homo_sapiens_assembly38.fasta | ||
ref_fasta_index | Homo_sapiens_assembly38.fasta.fai | ||
ref_alt | Homo_sapiens_assembly38.fasta.64.alt | ||
ref_sa | Homo_sapiens_assembly38.fasta.64.sa | ||
ref_amb | Homo_sapiens_assembly38.fasta.64.amb | ||
ref_bwt | Homo_sapiens_assembly38.fasta.64.bwt | ||
ref_ann | Homo_sapiens_assembly38.fasta.64.ann | ||
ref_pac | Homo_sapiens_assembly38.fasta.64.pac | ||
contamination_sites_ud | Homo_sapiens_assembly38.contam.UD | ||
contamination_sites_bed | Homo_sapiens_assembly38.contam.bed | ||
contamination_sites_mu | Homo_sapiens_assembly38.contam.mu | ||
Resource Files |
dbSNP_vcf | Homo_sapiens_assembly38.dbsnp138.vcf | |
dbSNP_vcf_index | Homo_sapiens_assembly38.dbsnp138.vcf.idx | ||
known_snps_sites_vcf | Mills_and_1000G_gold_standard.indels.hg38.vcf.gz | ||
known_snps_sites_vcf_index | Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi | ||
known_indels_sites_VCFs | Mills_and_1000G_gold_standard.indels.hg38.vcf.gz | ||
Homo_sapiens_assembly38.known_indels.vcf.gz | |||
known_indels_sites_indices | Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi | ||
Homo_sapiens_assembly38.known_indels.vcf.gz.tbi | |||
Interval Files |
wgs_calling_interval_list | wgs_calling_regions.hg38.interval_list *SEE NOTE BELOW | |
wgs_coverage_interval_list | wgs_coverage_regions.hg38.interval_list | ||
wgs_evaluation_interval_list | wgs_evaluation_regions.hg38.interval_list | ||
Small Test Input Datasets |
flowcell_unmapped_bams | H06HDADXX130110.1.ATCACGAT.20k_reads.bam | |
H06HDADXX130110.2.ATCACGAT.20k_reads.bam | |||
H06JUADXX130110.1.ATCACGAT.20k_reads.bam |
NOTE: The Exome Interval file whole_exome_illumina_coding_v1.Homo_sapiens_assembly38.targets.interval_list
is hosted at https://console.cloud.google.com/storage/browser/gatk-test-data/intervals/.
For on-prem, the workflow uses non-dockerized tools:
GATK Version can be download from here: https://github.com/broadinstitute/gatk/releases
SAMTools can be downloaded from here: http://www.htslib.org/download/
Picard tool can be downloaded here: https://broadinstitute.github.io/picard/