Kids First Data Resource Center Germline Copy Number Variant Workflow

The Kids First Data Resource Center (KFDRC) Germline Copy Number Variant (CNV) Workflow is a common workflow language (CWL) implmentation to generate CNV calls from an aligned reads BAM or CRAM file. The workflow makes use of CNVnator and GATK to call variants. These variants are annotated using AnnotSV.

Relevant Softwares and Versions

CNVnator: 0.4.1
GATK: 4.2.0.0
AnnotSV: 3.1.1

CNVnator

CNVnator is a read-depth (RD) based method for CNV discovery and genotyping. The method is based on combining the established mean-shift approach with additional refinements (multiple-bandwidth partitioning and GC correction) to broaden the range of discovered CNVs. Overall, for CNVs accessible by RD, CNVnator has high sensitivity (86%-96%), low false-discovery rate (3%-20%), high genotyping accuracy (93%-95%), and high resolution in breakpoint discovery (<200 bp in 90% of cases with high sequencing coverage). Furthermore, CNVnator is complementary in a straightforward way to split-read and read-pair approaches: It misses CNVs created by retrotransposable elements, but more than half of the validated CNVs that it identifies are not detected by split-read or read-pair.

Read more about the software in their paper: https://pubmed.ncbi.nlm.nih.gov/21324876/

GATK gCNV

GATK gCNV is a methodology and suite of software for discovering rare and common CNVs from next-generation sequencing (NGS) read-depth (RD) data. In GATK gCNV, sequencing bases are modeled via negative-binomial factor analysis, and copy-number states and genomic regions of high and low CNV activity are modeled using a hierarchical hidden Markov model (HHMM). Automatic differentiation variational inference (ADVI) and variational message passing are used to infer continuous and discrete latent variables in a principled framework. A deterministic annealing protocol is used to deal with the non-convexity of the variational objective function.

Read more about the software in their paper: https://github.com/broadinstitute/gatk/blob/master/docs/CNV/germline-cnv-caller-model.pdf

AnnotSV

AnnotSV is a program designed for annotating and ranking Structural Variations (SV). This tool compiles functionally, regulatory and clinically relevant information and aims at providing annotations useful to i) interpret SV potential pathogenicity and ii) filter out SV potential false positives.

Input Files

Universal
- aligned_reads: The germline BAM/CRAM input that has been aligned to a reference genome.
- indexed_reference_fasta: The reference genome fasta (and associated indicies) to which the germline BAM/CRAM was aligned.
- intervals/blacklist_intervals: Intervals to include or exclude from analysis
GATK
- contig_ploidy_model_tar: The contig-ploidy model directory generated by the DetermineGermlineContigPloidyCohortMode task in the Cohort workflow.
- gcnv_model_tars: Array of tars of the contig-ploidy model directories generated by the GermlineCNVCallerCohortMode tasks in the Cohort workflow.
AnnotSV
- annotsv_annotations_dir: These annotations are simply those from the install-human-annotation installation process run during AnnotSV installation (see: https://github.com/lgmgeo/AnnotSV/#quick-installation). Specifically these are the annotations installed with v3.1.1 of the software. Newer or older annotations can be slotted in here as needed.

Output Files

CNVnator
- cnvnator_vcf: Called CNVs in VCF format
- cnvnator_called_cnvs: Called CNVs from aligned_reads
- cnvnator_average_rd: Average RD stats
GATK
- gatk_gcnv_genotyped_intervals_vcfs: Per sample VCF files provides a detailed listing of the most likely copy-number call for each genomic interval included in the call-set, along with call quality, call genotype, and the phred-scaled posterior probability vector for all integer copy-number states.
- gatk_gcnv_genotyped_segments_vcfs: Per sample VCF files containing coalesced contiguous intervals that share the same copy-number call
- gatk_gcnv_denoised_copy_ratios: Per sample files concatenates posterior means for denoised copy ratios from all the call shards produced by the GermlineCNVCaller
AnnotSV
- cnvnator_annotated_cnvs: This file contains all records from the cnvnator_vcf that AnnotSV could annotate.
- gatk_gcnv_annotated_genotyped_segments: Per sample TSV files containing AnnotSV-annotated CNVs from gatk_gcnv_genotyped_segments_vcfs

Basic Info

D3b dockerfiles
Testing Tools:
- Seven Bridges CAVATICA Platform
- Common Workflow Language reference implementation (cwltool)

References

KFDRC AWS S3 bucket: s3://kids-first-seq-data/broad-references/, s3://kids-first-seq-data/pipeline-references/
CAVATICA: https://cavatica.sbgenomics.com/u/kfdrc-harmonization/kf-references/
Broad Institute Goolge Cloud: https://console.cloud.google.com/storage/browser/gcp-public-data--broad-references/hg38/v0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GERMLINE_CNV_README.md

GERMLINE_CNV_README.md

Kids First Data Resource Center Germline Copy Number Variant Workflow

Relevant Softwares and Versions

CNVnator

GATK gCNV

AnnotSV

Input Files

Output Files

Basic Info

References

Files

GERMLINE_CNV_README.md

Latest commit

History

GERMLINE_CNV_README.md

File metadata and controls

Kids First Data Resource Center Germline Copy Number Variant Workflow

Relevant Softwares and Versions

CNVnator

GATK gCNV

AnnotSV

Input Files

Output Files

Basic Info

References