Skip to content

Latest commit

 

History

History
26 lines (18 loc) · 5.46 KB

README.md

File metadata and controls

26 lines (18 loc) · 5.46 KB

Documentation

For updated and more detailed documentation, please refer to our Github Wiki page at https://github.com/pughlab/VisCap/wiki

Introduction

VisCap is a copy number detection and visualization tool written in R (www.r-project.org) for analysis of next-generation sequencing data derived from hybrid-capture experiments. It requires installation of the R libraries "gplots", "zoo", and “cluster”. We reported validation of this algorithm for detection of germline variation in Genetics in Medicine:

VisCap: inference and visualization of germ-line copy-number variants from targeted clinical sequencing data.
Pugh TJ, Amr SS, Bowser MJ, Gowrisankar S, Hynes E, Mahanta LM, Rehm HL, Funke B, Lebo MS.
Genet Med. 2015 Dec 17. doi: 10.1038/gim.2015.156. [Epub ahead of print]
PMID: 26681316
www.ncbi.nlm.nih.gov/pubmed/26681316

Input

As input, VisCap reads interval summary files generated by the Genome Analysis Toolkit (GATK)"DepthOfCoverage" tool (www.broadinstitute.org/gatk). For each sample, these files contain a summary of the total coverage of each genome region interval listed in a reference interval list. For our purposes, this interval list contains the coordinates of each target (usually an exon) captured by a specific bait set. Importantly, the bed file provided to GATK DepthOfCoverage must not contain any overlapping intervals as these are merged into single intervals resulting in a DepthOfCoverage file containing few intervals than the original bed file.

Algorithm

The initial step in the VisCap program is to generate a matrix of all intervals captured (~1,383 for OtoGenome) and the fractional of total coverage assigned to these interval within each sample from a batch (10 samples for production runs). When different interval lists are used to make the coverage files for samples within a batch, VisCap will use only the intervals common to all samples. Therefore, inclusion of even a single coverage file with a small number of targets will result in only these targets being considered across the entire batch. To normalize fractional coverage values for copy number detection, each is divided by the median for that target across the entire batch. These are stored as a matrix of log2 ratios and written to a file (log2_ratio_table.xls).

The X-chromosome requires further normalization as there are significant fractional coverage differences between males and females. Depending on the balance of males and females in the batch, males may display single-copy loss of chromosome X or females may display a gain. These patterns are evident from two clusters of boxplots. The clusters are detected informatically by removing outlier probes and then partitioning all samples around two medoids, a more robust version of K-means clustering. Each cluster is then normalized to zero by subtracting the median of each cluster from each sample. To facilitate review of this procedure, boxplots of log2 ratios are generated before and after subtraction of the cluster medians (QC_chrX_pre-scale.pdf and QC_chrX_post-scale.pdf).

Boxplots are constructed using the standard "boxplot()" command within R. This provides a visual representation of the distribution of log2 ratios from each sample (QC_cnv_boxplot.pdf) as well as a numeric summary used for subsequent thresholding and quality control (QC_cnv_boxplots.xls). The boxplot output summary includes: upper whisker, Q3, median, Q1, and lower whisker. Upper (lower) whiskers are the greatest (lowest) values that fall within Q3 (Q1) plus a quantity (default: 3) times the interquartile range (Q3 - Q1). A sample fails quality control if either of the boxplot whiskers extends beyond the expected theoretical log2 ratio for a single-copy gain (0.58) or a single-copy loss (-1). Failed samples are identified in the boxplot summary output, and VisCap automatically repeats the analysis until all samples pass quality control. The complete set of output files is generated for each iteration and stored in separate folders. For further quality control, log2 ratios across the entire batch are visualized in a single heatmap-based output (QC_cnv_heatmap.pdf).

The thresholding strategy for determining if a log2 ratio represents a gain or a loss is dependent on the boxplot whiskers and a set of fixed thresholds. These fixed thresholds were established based on the analysis of positive control samples (see validation report) and they represent the minimum log2 ratio for gains (0.40) and maximum log2 ratio for losses (-0.55) at which a call is made. A copy number variant is called when a log2 ratio exceeds both the upper (lower) fixed threshold and the upper (lower) whisker for the sample. The minimum number of consecutive exons with log2 ratios outside of these thresholds to call a copy number variant can also be set by the user (default: 1).

Output

For each sample, candidate copy number variants are output as a text data table (*.cnvs.xls). All calls from all intervals across all samples are also written to a file (cnv_boxplot_outliers.xls, losses -1, copy-neutral 0, gains 1). To facilitate visual review of the data for each sample, log2 ratios are plotted by relative genome order (i.e. rank order, not to scale on genome) and data points supporting copy number variants are color-coded (gains: red, losses: blue). This plot is overlaid with guidelines depicting the two sets of thresholds used for copy number variation (whiskers: dashed, fixed: solid black) and the theoretical log2 ratios (solid light gray) for single-copy gains and single-copy losses.