The Kids First Data Resource Center (KFDRC) Germline Copy Number Variant (CNV) Workflow is a common workflow language (CWL) implmentation to generate CNV calls from an aligned reads BAM or CRAM file. The workflow makes use of CNVnator and GATK to call variants. These variants are annotated using AnnotSV.
CNVnator is a read-depth (RD) based method for CNV discovery and genotyping. The method is based on combining the established mean-shift approach with additional refinements (multiple-bandwidth partitioning and GC correction) to broaden the range of discovered CNVs. Overall, for CNVs accessible by RD, CNVnator has high sensitivity (86%-96%), low false-discovery rate (3%-20%), high genotyping accuracy (93%-95%), and high resolution in breakpoint discovery (<200 bp in 90% of cases with high sequencing coverage). Furthermore, CNVnator is complementary in a straightforward way to split-read and read-pair approaches: It misses CNVs created by retrotransposable elements, but more than half of the validated CNVs that it identifies are not detected by split-read or read-pair.
Read more about the software in their paper: https://pubmed.ncbi.nlm.nih.gov/21324876/
GATK gCNV is a methodology and suite of software for discovering rare and common CNVs from next-generation sequencing (NGS) read-depth (RD) data. In GATK gCNV, sequencing bases are modeled via negative-binomial factor analysis, and copy-number states and genomic regions of high and low CNV activity are modeled using a hierarchical hidden Markov model (HHMM). Automatic differentiation variational inference (ADVI) and variational message passing are used to infer continuous and discrete latent variables in a principled framework. A deterministic annealing protocol is used to deal with the non-convexity of the variational objective function.
Read more about the software in their paper: https://github.com/broadinstitute/gatk/blob/master/docs/CNV/germline-cnv-caller-model.pdf
AnnotSV is a program designed for annotating and ranking Structural Variations (SV). This tool compiles functionally, regulatory and clinically relevant information and aims at providing annotations useful to i) interpret SV potential pathogenicity and ii) filter out SV potential false positives.
- Universal
aligned_reads
: The germline BAM/CRAM input that has been aligned to a reference genome.indexed_reference_fasta
: The reference genome fasta (and associated indicies) to which the germline BAM/CRAM was aligned.intervals/blacklist_intervals
: Intervals to include or exclude from analysis
- GATK
contig_ploidy_model_tar
: The contig-ploidy model directory generated by the DetermineGermlineContigPloidyCohortMode task in the Cohort workflow.gcnv_model_tars
: Array of tars of the contig-ploidy model directories generated by the GermlineCNVCallerCohortMode tasks in the Cohort workflow.
- AnnotSV
annotsv_annotations_dir
: These annotations are simply those from the install-human-annotation installation process run during AnnotSV installation (see: https://github.com/lgmgeo/AnnotSV/#quick-installation). Specifically these are the annotations installed with v3.1.1 of the software. Newer or older annotations can be slotted in here as needed.
- CNVnator
cnvnator_vcf
: Called CNVs in VCF formatcnvnator_called_cnvs
: Called CNVs fromaligned_reads
cnvnator_average_rd
: Average RD stats
- GATK
gatk_gcnv_genotyped_intervals_vcfs
: Per sample VCF files provides a detailed listing of the most likely copy-number call for each genomic interval included in the call-set, along with call quality, call genotype, and the phred-scaled posterior probability vector for all integer copy-number states.gatk_gcnv_genotyped_segments_vcfs
: Per sample VCF files containing coalesced contiguous intervals that share the same copy-number callgatk_gcnv_denoised_copy_ratios
: Per sample files concatenates posterior means for denoised copy ratios from all the call shards produced by the GermlineCNVCaller
- AnnotSV
cnvnator_annotated_cnvs
: This file contains all records from thecnvnator_vcf
that AnnotSV could annotate.gatk_gcnv_annotated_genotyped_segments
: Per sample TSV files containing AnnotSV-annotated CNVs fromgatk_gcnv_genotyped_segments_vcfs
- D3b dockerfiles
- Testing Tools:
- KFDRC AWS S3 bucket: s3://kids-first-seq-data/broad-references/, s3://kids-first-seq-data/pipeline-references/
- CAVATICA: https://cavatica.sbgenomics.com/u/kfdrc-harmonization/kf-references/
- Broad Institute Goolge Cloud: https://console.cloud.google.com/storage/browser/gcp-public-data--broad-references/hg38/v0