Skip to content

Latest commit

 

History

History
92 lines (74 loc) · 5.31 KB

GERMLINE_CNV_README.md

File metadata and controls

92 lines (74 loc) · 5.31 KB

Kids First Data Resource Center Germline Copy Number Variant Workflow

The Kids First Data Resource Center (KFDRC) Germline Copy Number Variant (CNV) Workflow is a common workflow language (CWL) implmentation to generate CNV calls from an aligned reads BAM or CRAM file. The workflow makes use of CNVnator and GATK to call variants. These variants are annotated using AnnotSV.

Relevant Softwares and Versions

CNVnator

CNVnator is a read-depth (RD) based method for CNV discovery and genotyping. The method is based on combining the established mean-shift approach with additional refinements (multiple-bandwidth partitioning and GC correction) to broaden the range of discovered CNVs. Overall, for CNVs accessible by RD, CNVnator has high sensitivity (86%-96%), low false-discovery rate (3%-20%), high genotyping accuracy (93%-95%), and high resolution in breakpoint discovery (<200 bp in 90% of cases with high sequencing coverage). Furthermore, CNVnator is complementary in a straightforward way to split-read and read-pair approaches: It misses CNVs created by retrotransposable elements, but more than half of the validated CNVs that it identifies are not detected by split-read or read-pair.

Read more about the software in their paper: https://pubmed.ncbi.nlm.nih.gov/21324876/

GATK gCNV

GATK gCNV is a methodology and suite of software for discovering rare and common CNVs from next-generation sequencing (NGS) read-depth (RD) data. In GATK gCNV, sequencing bases are modeled via negative-binomial factor analysis, and copy-number states and genomic regions of high and low CNV activity are modeled using a hierarchical hidden Markov model (HHMM). Automatic differentiation variational inference (ADVI) and variational message passing are used to infer continuous and discrete latent variables in a principled framework. A deterministic annealing protocol is used to deal with the non-convexity of the variational objective function.

Read more about the software in their paper: https://github.com/broadinstitute/gatk/blob/master/docs/CNV/germline-cnv-caller-model.pdf

AnnotSV

AnnotSV is a program designed for annotating and ranking Structural Variations (SV). This tool compiles functionally, regulatory and clinically relevant information and aims at providing annotations useful to i) interpret SV potential pathogenicity and ii) filter out SV potential false positives.

Input Files

  • Universal
    • aligned_reads: The germline BAM/CRAM input that has been aligned to a reference genome.
    • indexed_reference_fasta: The reference genome fasta (and associated indicies) to which the germline BAM/CRAM was aligned.
    • intervals/blacklist_intervals: Intervals to include or exclude from analysis
  • GATK
    • contig_ploidy_model_tar: The contig-ploidy model directory generated by the DetermineGermlineContigPloidyCohortMode task in the Cohort workflow.
    • gcnv_model_tars: Array of tars of the contig-ploidy model directories generated by the GermlineCNVCallerCohortMode tasks in the Cohort workflow.
  • AnnotSV
    • annotsv_annotations_dir: These annotations are simply those from the install-human-annotation installation process run during AnnotSV installation (see: https://github.com/lgmgeo/AnnotSV/#quick-installation). Specifically these are the annotations installed with v3.1.1 of the software. Newer or older annotations can be slotted in here as needed.

Output Files

  • CNVnator
    • cnvnator_vcf: Called CNVs in VCF format
    • cnvnator_called_cnvs: Called CNVs from aligned_reads
    • cnvnator_average_rd: Average RD stats
  • GATK
    • gatk_gcnv_genotyped_intervals_vcfs: Per sample VCF files provides a detailed listing of the most likely copy-number call for each genomic interval included in the call-set, along with call quality, call genotype, and the phred-scaled posterior probability vector for all integer copy-number states.
    • gatk_gcnv_genotyped_segments_vcfs: Per sample VCF files containing coalesced contiguous intervals that share the same copy-number call
    • gatk_gcnv_denoised_copy_ratios: Per sample files concatenates posterior means for denoised copy ratios from all the call shards produced by the GermlineCNVCaller
  • AnnotSV
    • cnvnator_annotated_cnvs: This file contains all records from the cnvnator_vcf that AnnotSV could annotate.
    • gatk_gcnv_annotated_genotyped_segments: Per sample TSV files containing AnnotSV-annotated CNVs from gatk_gcnv_genotyped_segments_vcfs

Basic Info

References