This is not an official Verily product.
This repository contains code to perform cohort-level quality control checks on human genomic variants. Cloud technology is used to perform queries in parallel. For prior work, see Cloud-based interactive analytics for terabytes of genomic variants data.
Before running the queries yourself, you can see the results on a few public datasets:
- QC overview reports on DeepVariant Platinum Genomes data
- QC overview reports on Simons Genome Diversity Project data
- QC overview reports on 1000 Genomes data
- example ad hoc explorations of QC results
The queries in this repository assume that the VCFs were loaded to BigQuery using Variant Transforms with the MOVE_TO_CALLS merge strategy included.
Using the MOVE_TO_CALLS merge strategy will produce a core set of columns common to all tables created from VCFs and calls for the exact same (reference_name
, start_position
, end_position
, reference_bases
, and all alternate_bases
) grouped together in a single row.
We recommend loading single-sample VCFs into a "genome call table" and also the multisample VCF into a "multisample-variants table".
If you do not have a multisample VCF, you could:
- use https://github.com/gatk-workflows/gatk4-germline-snps-indels#joint-discovery-gatk- to create one
- use https://github.com/verilylifesciences/joint-genotype to create one
- or skip the queries that require knowing how many samples match the reference such as Hardy-Weinberg Equilibrium
If your sample information does not already include ancestry, you can predict the ancestry for each genome using Genomic ancestry inference with deep learning.
Run the RMarkdown parameterized reports to get an overview of your data.
Drill down further on results by creating additional plots and/or performing additional queries. For example, these queries can be used from the context of Jupyter notebooks, and then additional queries or other queries can be used to further explain the results for a particular dataset.
The methods make use of:
- Standard SQL via BigQuery
- Apache Beam via Dataflow
- TensorFlow via Cloud Machine Learning Engine
Each technology has introductory material that may help you when working with the code in this repository.