DL4VC is an advanced deep learning based variant caller for short-read based germline variant calling.
It proposes a deep averaging network (DAN) designed specifically for variant calling. The model takes as
input a tensor encoding aligned reads in a proposed variant region, a variant proposal,
and outputs a softmax over three cateogies: {no variant, heterozygous variant, homozygous variant}
.
This model takes into account the independence of each short input read sequence by transforming individual
reads through a series of 1D convolutional layers, limiting the communication between individual reads
to averaging and concatenating operations, before passing them into a fully connected network.
Our purpose-built model achieves state of the art results on the precisionFDA germline variant calling dataset (compared post competition).
To facilitate future work, we release our code, trained models and pre-processed public domain datasets through this repo.
PrecisionFDA Truth Challenge results vs DL4VC
Variant Caller | Type | F1 | Recall | Precision |
---|---|---|---|---|
rpoplin-dv42 | Overall Indels SNPS |
0.998597 0.989802 0.999587 |
0.998275 0.987883 0.999447 |
0.998919 0.991728 0.999728 |
dgrover-gatk | Overall Indels SNPS |
0.998905 0.994008 0.999456 |
0.999005 0.993455 0.999631 |
0.998804 0.994561 0.999282 |
astatham-gatk | Overall Indels SNPS |
0.995679 0.993422 0.995934 |
0.992122 0.992401 0.992091 |
0.999261 0.994446 0.999807 |
bgallagher-sentieon | Overall Indels SNPS |
0.998626 0.992676 0.999296 |
0.998910 0.992140 0.999673 |
0.998342 0.993213 0.998919 |
DL4VC | Overall Indels SNPS |
0.998924 0.992949 0.999596 |
0.999076 0.994708 0.999566 |
0.998772 0.991196 0.999625 |
- PyTorch-based model training and inference
- 1D convolutional model with learned embeddings of bases
- Variant proposal encodings
- Down-sampling of easy examples to speed up training by 5x
Section | Description |
---|---|
Installation | System and code setup instructions |
Data | Pre-processed datasets from precisionFDA to reproduce DL4VC results |
Step by step guidelines | Instructions to train and run inference with DL4VC |
The installation has been tested on bare metal as well as conda virtual environments. We recommend conda environments because they simplify the installation of non-python dependencies.
Core dependencies -
- BCF Tools
- Tabix
- Python 3.5+ environment
- vcfeval (optional, only needed for comparing with other VCFs)
Setup
- git clone https://github.com/clara-genomics/DL4VC.git
- cd DL4VC
- pip install -r requirements.txt
Please follow the dataset instructions in the Dataset Readme to download pre-processed data and model checkpoints from our experiment. The results mentioned in the Accuracy Highlights section can be reproduced using the same datasets.
We have created a detailed step by step guideline to run both training and inference using the DL4VC pipeline in our Step by Step Guide.