variant-qc

Disclaimer

This is not an official Verily product.

variant-qc

This repository contains code to perform cohort-level quality control checks on human genomic variants. Cloud technology is used to perform queries in parallel. For prior work, see Cloud-based interactive analytics for terabytes of genomic variants data.

View output from these queries run on public data

Before running the queries yourself, you can see the results on a few public datasets:

QC overview reports on DeepVariant Platinum Genomes data
QC overview reports on Simons Genome Diversity Project data
QC overview reports on 1000 Genomes data
example ad hoc explorations of QC results

Run these queries on your own data

Load data to BigQuery

The queries in this repository assume that the VCFs were loaded to BigQuery using Variant Transforms with the MOVE_TO_CALLS merge strategy included.

Using the MOVE_TO_CALLS merge strategy will produce a core set of columns common to all tables created from VCFs and calls for the exact same (reference_name, start_position, end_position, reference_bases, and all alternate_bases) grouped together in a single row.

We recommend loading single-sample VCFs into a "genome call table" and also the multisample VCF into a "multisample-variants table".

If you do not have a multisample VCF, you could:

use https://github.com/gatk-workflows/gatk4-germline-snps-indels#joint-discovery-gatk- to create one
use https://github.com/verilylifesciences/joint-genotype to create one
or skip the queries that require knowing how many samples match the reference such as Hardy-Weinberg Equilibrium

Predict ancestry

If your sample information does not already include ancestry, you can predict the ancestry for each genome using Genomic ancestry inference with deep learning.

Run the QC overview reports

Run the RMarkdown parameterized reports to get an overview of your data.

Drill down on results

Drill down further on results by creating additional plots and/or performing additional queries. For example, these queries can be used from the context of Jupyter notebooks, and then additional queries or other queries can be used to further explain the results for a particular dataset.

Technologies used

The methods make use of:

Standard SQL via BigQuery
Apache Beam via Dataflow
TensorFlow via Cloud Machine Learning Engine

Each technology has introductory material that may help you when working with the code in this repository.

Name		Name	Last commit message	Last commit date
Latest commit History 343 Commits
R		R
notebooks		notebooks
sql		sql
.gitattributes		.gitattributes
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
run_tests.sh		run_tests.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Disclaimer

variant-qc

View output from these queries run on public data

Run these queries on your own data

Load data to BigQuery

Predict ancestry

Run the QC overview reports

Drill down on results

Technologies used

About

Releases

Packages

Contributors 10

Languages

License

verilylifesciences/variant-qc

Folders and files

Latest commit

History

Repository files navigation

Disclaimer

variant-qc

View output from these queries run on public data

Run these queries on your own data

Load data to BigQuery

Predict ancestry

Run the QC overview reports

Drill down on results

Technologies used

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 10

Languages

Packages