If you'd like to run the variant calling pipeline on its own, you should provide required input in configs/config-variant_calling.yaml.
- FASTQ files containing DNA sequencing reads for each sample. Specify the location of these files in a tab delimited text file containing three columns:
unique_sample_name | dna_fastq_path_1 | dna_fastq_path_2
where each row is a different sample. - A reference genome for DNA-seq alignment
- A BWA index of the aforementioned reference genome
-
If you choose to use GATK's base quality (BQSR) and variant quality (VQSR) score recalibration, you must provide the following, as described in this GATK article:
- True sites training resource: HapMap
- True sites training resource: Omni
- Non-true sites training resource: 1000G
- Known sites resource, not used in training: dbSNP
These resources can usually be easily obtained from the GATK resource bundle.
You will also be required to specify a target sensitivity value, as described in a previously mentioned GATK article.
-
If you choose not to perform VQSR, the pipeline will default to hard filtering your variants. You will need to provide a GATK filter expression, as described in this GATK article. One example might be
"QD < 2.0 || FS > 60.0 || MQ < 40.0 || SOR > 3.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0"
.
When calling Snakemake, use options -s
and --configfile
to specify the location of the Snakefile and its corresponding config file. We also recommend using the --use-conda
option to let Snakemake handle all dependencies of the pipeline.
snakemake -s Snakefiles/Snakefile-variant_calling --configfile configs/config-variant_calling.yaml --use-conda
The variant calling pipeline creates the following directories under the output directory specified in your config file. The genotypes
folder will contain the final output, a filtered VCF containing heterozygous SNPs for all samples.
- dna_align - output from BWA and samtools
- base_recal - output from GATK's BQSR
- haplotype - output from GATK's Haplotype Caller and a file
ALL.genotype.vcf.gz
containing genotyped variants for all samples - variant_filter - output from variant filtering (either VQSR or hard filtering) of SNPs
- genotypes - heterozgyotes from the filtered VCF