This repository contains a few different files - each tuned for certain requirements.
├── 2T_PairedSingleSampleWf_optimized.inputs.json → Throughput JSON file
├── 56T_PairedSingleSampleWf_optimized.inputs.20k.json → 20k test data JSON file
├── 56T_PairedSingleSampleWf_optimized.inputs.json → Latency JSON file
├── HDD_2T_PairedSingleSampleWf_optimized.inputs.json → Throughput JSON file (for HDD)
├── HDD_56T_PairedSingleSampleWf_optimized.inputs.json → Latency JSON file (for HDD)
├── PairedSingleSampleWf_noqc_nocram_optimized.wdl → WDL optimized for on-prem
├── PairedSingleSampleWf_noqc_nocram_withcleanup_optimized.wdl → WDL optimized for on-prem with cleanup of output results (for throughput analysis)
For the PairedSingleSampleWf_noqc_nocram_optimized.wdl file, modify Line 1270 to the path where datasets reside in your cluster.
For the PairedSingleSampleWf_noqc_nocram_withcleanup_optimized.wdl file, modify Line 1317 to the path where datasets reside in your cluster.
In the JSON files, modify the paths to the datasets and tools where they reside in your cluster.
Assuming the environemnt has been setup to offload the pairhmm kernel of HaplotypeCaller to FPGA - the below changes must be enabled in the WDL/JSON files (based on the comments) to make use of the FPGA.
a. In the WDL file, for task Haplotype Caller runtime section, uncomment the line: require_fpga: "yes"
b. In the JSON file, change the "PairedEndSingleSampleWorkflow.gatk_gkl_pairhmm_implementation"
from "VECTOR_LOGLESS_CACHING"
to "VECTOR_LOGLESS_CACHING_FPGA_EXPERIMENTAL"
.
Refer to: WDL and JSON examples.
Contact Intel/Broad for access to the WGS data needed for this workflow.
For on-prem, the workflow uses non-dockerized tools. To keep up with the exact versions released by Broad for their best practices workflow, we download the tools from the docker image to our shared file system.
Run the command:
docker run -v /path/to/shared_filesystem:/path/to/shared_filesystem -it broadinstitute/genomes-in-the-cloud:2.3.1-1504795437 /bin/bash
This command will pull the docker image (if it is not already there locally), and put you within the container from where you can copy the tools needed for the workflow.
root@54754360159e:/usr/gitc# cp -r /usr/local/bin/samtools gatk4 bwa picard.jar /path/to/shared_filesystem
root@54754360159e:/usr/gitc# exit
In addition to above, this workflow uses the latest optimized GATK 3.8-1 jar
with optimizations which can be obtained from GATK archive website.
Lastly, Hybrid workflow also needs a tool called "VerifyBamID" that can be downloaded as follows:
- If on Centos, you will need to do a
yum install curl-devel
before proceeding.
cd /path/to/shared_filesystem
wget https://github.com/Griffan/VerifyBamID/archive/c8a66425c312e5f8be46ab0c41f8d7a1942b6e16.zip && \
unzip c8a66425c312e5f8be46ab0c41f8d7a1942b6e16.zip && \
cd VerifyBamID-c8a66425c312e5f8be46ab0c41f8d7a1942b6e16 && \
mkdir build && \
cd build && \
CC=$(which gcc-4.9) CXX=$(which g++-4.9) cmake .. && \
make && \
make test && \
cd ../../ && \
mv VerifyBamID-c8a66425c312e5f8be46ab0c41f8d7a1942b6e16/bin/VerifyBamID . && \
rm -rf c8a66425c312e5f8be46ab0c41f8d7a1942b6e16.zip VerifyBamID-c8a66425c312e5f8be46ab0c41f8d7a1942b6e16