Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DeepRVAT annotation pipeline #28

Merged
merged 109 commits into from
Oct 20, 2023
Merged
Show file tree
Hide file tree
Changes from 95 commits
Commits
Show all changes
109 commits
Select commit Hold shift + click to select a range
efd46a8
added support for deepRiPe and deepSea, ability to concat and merge a…
Aug 4, 2023
1545c49
changed config file
Aug 4, 2023
44702f9
Create deeprvat_annotations.yml
endast Aug 4, 2023
a5fed75
Update deeprvat_annotations.yml
endast Aug 4, 2023
bb66bce
cleaned up commented out code in annotations.py
Aug 4, 2023
f0e1b86
Merge branch 'Annotations' of github.com:PMBio/deeprvat into Annotations
Aug 4, 2023
98aad4e
Update deeprvat_annotations.yml
endast Aug 7, 2023
a012f2c
Update deeprvat_annotations.yml
endast Aug 7, 2023
b58f654
Update deeprvat_annotations.yml
endast Aug 7, 2023
d6ef2f5
Use python=3.9.16
endast Aug 7, 2023
74dd04e
Update deeprvat_annotations.yml
endast Aug 7, 2023
cfe68ca
increased memory resource for rule concat_deepSea
Aug 7, 2023
deec6b7
Merge branch 'main' into Annotations
Aug 7, 2023
6e72396
Merge branch 'Annotations' of github.com:PMBio/deeprvat into Annotations
Aug 7, 2023
5091be0
Update deeprvat_annotations.yml
endast Aug 8, 2023
d8806e2
deleted obsolete environment yaml file; removed absolute path from an…
Aug 8, 2023
b327392
Merge branch 'Annotations' of github.com:PMBio/deeprvat into Annotations
Aug 8, 2023
03218c8
updated environment, config, removed dask dependency
Aug 14, 2023
3aac39a
changed annotations.py to make concatenation
Aug 23, 2023
c987400
added support for different deepRIPE models (hg2, k5 and parclip)
Aug 24, 2023
9a8391c
fixup! Format Python code with psf/black pull_request
Aug 24, 2023
44acb9e
changed some temporary paths
Aug 24, 2023
33a5b78
filtering and aggregating vep columns
Aug 25, 2023
e18412c
fixup! Format Python code with psf/black pull_request
Aug 25, 2023
5d2c32c
Update README.md
Marcel-Mueck Aug 25, 2023
5f9a2bc
Update README.md
Marcel-Mueck Aug 25, 2023
16206dc
Update README.md
Marcel-Mueck Aug 25, 2023
f707b9e
Update README.md
Marcel-Mueck Aug 25, 2023
5d65cdf
speedups using joblib and batching
bfclarke Sep 12, 2023
1fec6f2
remove unused function
bfclarke Sep 13, 2023
530da1f
n_jobs as parameter
Sep 14, 2023
a85e404
Small changes
Sep 28, 2023
4be33e8
small changes
Sep 28, 2023
1bff884
elaborated data sources in README for annotations prescored data
Oct 2, 2023
c2ffb57
added required repos to requirements
Oct 5, 2023
87a33ba
clarified that user has to actiavte deeprvat_annotations
Oct 5, 2023
e91ca3d
added multithreading for deepripe scores to snakemake file
Oct 6, 2023
ebca4f0
changed repo name
Oct 6, 2023
2d06c10
fixed input error on merge_deepripe rules
Oct 6, 2023
a5125a8
added per-gene option in vep. no filtering to gene-variant pair level…
Oct 6, 2023
c3347c1
added VEP plugin repo and faatpipe repo to config and setup.sh file
Oct 9, 2023
f3df8a8
bug fix
Oct 9, 2023
ae950cc
Update ensembl-vep version
endast Oct 11, 2023
df20e1b
Make use of modules optional
endast Oct 13, 2023
2c62beb
added merging of absplice ond deepsea pca to complete annotations in …
Oct 13, 2023
138958b
Merge branch 'deepripe-batch' of github.com:PMBio/deeprvat into deepr…
Oct 13, 2023
11c69a5
Fix processing of Uploaded_variation col
endast Oct 13, 2023
d9b303b
Merge branch 'deepripe-batch' of https://github.com/PMBio/deeprvat in…
endast Oct 13, 2023
81324a0
add vep to conda env
endast Oct 13, 2023
17580a3
Merge branch 'deepripe-batch' of github.com:PMBio/deeprvat into deepr…
endast Oct 13, 2023
15f50bc
Fix snakemake typos
endast Oct 13, 2023
3bb17c6
Remoce unused import
endast Oct 13, 2023
e29fcc0
Fix split of cols
endast Oct 15, 2023
7be1915
Only read the file in one place
endast Oct 15, 2023
132efc4
fixed path names in snakemake rules
Oct 16, 2023
7cdb6ab
Merge branch 'deepripe-batch' of github.com:PMBio/deeprvat into deepr…
Oct 16, 2023
22b979a
small changes
Oct 16, 2023
5995e8b
Use gz files
endast Oct 16, 2023
c9bb44b
Update annotations.snakefile
endast Oct 16, 2023
c4b452c
cast - as na in vep files
endast Oct 17, 2023
0599c89
pca object is saved and loaded as np array of components by default.
Oct 17, 2023
5affade
Merge branch 'deepripe-batch' of github.com:PMBio/deeprvat into deepr…
Oct 17, 2023
1e0ab68
Change vcf_file path to input_files
endast Oct 17, 2023
f1492b3
Update test data
endast Oct 18, 2023
3ff0ad1
Fix columntypes
endast Oct 18, 2023
c8006a7
Merge branch 'deepripe-batch' of github.com:PMBio/deeprvat into deepr…
endast Oct 18, 2023
e65e2c6
Fix typo in keyword argument
endast Oct 18, 2023
c7a21dc
Fix multiple merge problems from example run
endast Oct 18, 2023
66eb97c
Merge branch 'main' into deepripe-batch
endast Oct 18, 2023
c723aa0
fixup! Format Python code with psf/black pull_request
Oct 18, 2023
0c45bf4
Rename to pca_matrix
endast Oct 18, 2023
fb0f1e0
revert to components 100
endast Oct 18, 2023
a702e31
make deepsea_pca_n_components a config value
endast Oct 19, 2023
d93b920
add DeepRVAT-Annotation-Pipeline-Smoke-Tests
endast Oct 19, 2023
ee0f3bc
rename bcf_file_pattern to source_variant_file_pattern
endast Oct 19, 2023
8d1b0d5
Create pvcf_blocks.txt
endast Oct 19, 2023
ff9d3d0
Update github-actions.yml
endast Oct 19, 2023
5718952
Add test vcf data
endast Oct 19, 2023
bb612b3
_variants_header is gzipped
endast Oct 19, 2023
b4ac7ca
Fix extract variants
endast Oct 19, 2023
824b05e
Fix reference
endast Oct 19, 2023
e983493
Update github-actions.yml
endast Oct 19, 2023
a9bb870
Create hg38.fa
endast Oct 19, 2023
0e0fdae
Update github-actions.yml
endast Oct 19, 2023
d7d4f4f
Create variants.tsv.gz
endast Oct 19, 2023
1a0680d
Update github-actions.yml
endast Oct 19, 2023
bb0c60e
Create .gitkeep
endast Oct 19, 2023
dced52f
Update github-actions.yml
endast Oct 19, 2023
bac5dc1
cleanup
endast Oct 19, 2023
22a5442
Revert "cleanup"
endast Oct 19, 2023
1ad2b25
Update github-actions.yml
endast Oct 19, 2023
2a444c6
Restore all pipelines
endast Oct 19, 2023
c2b0226
Cleanup
endast Oct 19, 2023
51c7950
Snakefmt annotations.snakefile
endast Oct 19, 2023
fd2071c
Update README.md
endast Oct 20, 2023
9b04294
Refactor annotions.py
endast Oct 20, 2023
510cc74
fixup! Format Python code with psf/black pull_request
Oct 20, 2023
528927a
Refactor module load and add comments to conf
endast Oct 20, 2023
a2f19a1
Merge branch 'deepripe-batch' of https://github.com/PMBio/deeprvat in…
endast Oct 20, 2023
b3d22f1
remove resources
endast Oct 20, 2023
82e154d
remove comment
endast Oct 20, 2023
be45635
remove comment
endast Oct 20, 2023
fb58ed0
add link to absplice readme
endast Oct 20, 2023
c5be60d
add default chr
endast Oct 20, 2023
df354d6
snakefmt
endast Oct 20, 2023
820a8a2
Use chr21,chr22 in example
endast Oct 20, 2023
5444ffd
Update annotations.snakefile
endast Oct 20, 2023
569252b
Fix load cmd from config
endast Oct 20, 2023
5c4e2f4
Update deeprvat_annotation_config.yaml
endast Oct 20, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions .github/workflows/github-actions.yml
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,20 @@ jobs:
args: '-j 2 -n --configfile pipelines/config/deeprvat_preprocess_config.yaml'
stagein: 'touch example/preprocess/workdir/reference/GRCh38.primary_assembly.genome.fa'


DeepRVAT-Annotation-Pipeline-Smoke-Tests:
runs-on: ubuntu-latest
steps:
- name: Check out repository code
uses: actions/checkout@v3
- name: Annotations Smoke Test
uses: snakemake/snakemake-github-action@v1.25.1
with:
directory: 'example/annotations'
snakefile: 'pipelines/annotations.snakefile'
args: '-j 2 -n --configfile pipelines/config/deeprvat_annotation_config.yaml'


DeepRVAT-Preprocessing-Pipeline-Tests:
runs-on: ubuntu-latest
needs: DeepRVAT-Preprocessing-Pipeline-Smoke-Tests
Expand Down
111 changes: 88 additions & 23 deletions deeprvat/annotations/README.md
Original file line number Diff line number Diff line change
@@ -1,55 +1,120 @@
# DeepRVAT Annotation pipeline

This pipeline is based on [snakemake](https://snakemake.readthedocs.io/en/stable/). It uses [bcftools+samstools](https://www.htslib.org/), as well as [perl](https://www.perl.org/), [CADD](https://cadd.gs.washington.edu/) and [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html), including plugins for [primateAI](https://github.com/Illumina/PrimateAI) and [spliceAI](https://github.com/Illumina/SpliceAI), Future releases will include further annotation tools like [abSplice](https://github.com/gagneurlab/absplice), [deepSEA](http://deepsea.princeton.edu/job/analysis/create/) and [deepRiPe](https://ohlerlab.mdc-berlin.de/software/DeepRiPe_140/).
This pipeline is based on [snakemake](https://snakemake.readthedocs.io/en/stable/). It uses [bcftools + samstools](https://www.htslib.org/), as well as [perl](https://www.perl.org/), [deepRiPe](https://ohlerlab.mdc-berlin.de/software/DeepRiPe_140/) and [deepSEA](http://deepsea.princeton.edu/) as well as [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html), including plugins for [primateAI](https://github.com/Illumina/PrimateAI) and [spliceAI](https://github.com/Illumina/SpliceAI). DeepRiPe annotations were acquired using [faatpipe repository by HealthML](https://github.com/HealthML/faatpipe)[[1]](#1) and DeepSea annotations were calculated using [kipoi-veff2](https://github.com/kipoi/kipoi-veff2)[[2]](#2), abSplice scores were computet using [abSplice](https://github.com/gagneurlab/absplice/)[[3]](#3)

![dag](https://github.com/PMBio/deeprvat/assets/23211603/d483831e-3558-4e21-9845-4b62ad4eecc3)
*Figure 1: Example DAG of annoation pipeline using only two bcf files as input.*

## Input

The pipeline uses compressed vcf files containing variant information, a reference fasta file as well as a text file that maps data blocks to chromosomes as input. It is expected that the vcf files contain the columns "CHROM" "POS" "ID" "REF" and "ALT". Any other columns, including genotype information are stripped from the data before annotation tools are used on the data. The variants may be split into several vcf files for each chromosome and each "block" of data. The filenames should then contain the corresponding chromosome and block number. The pattern of the file names, as well as file structure may be specified in the corresponding [config file](config/deeprvat_annotation_config.yaml).
The pipeline uses left-normalized bcf files containing variant information, a reference fasta file as well as a text file that maps data blocks to chromosomes as input. It is expected that the bcf files contain the columns "CHROM" "POS" "ID" "REF" and "ALT". Any other columns, including genotype information are stripped from the data before annotation tools are used on the data. The variants may be split into several vcf files for each chromosome and each "block" of data. The filenames should then contain the corresponding chromosome and block number. The pattern of the file names, as well as file structure may be specified in the corresponding [config file](config/deeprvat_annotation_config.yaml).

## Requirements
BCFtools as well as HTSlib should be installed on the machine,
- [CADD](https://github.com/kircherlab/CADD-scripts/tree/master/src/scripts) as well as
- [VEP](http://www.ensembl.org/info/docs/tools/vep/script/vep_download.html),
- [absplice](https://github.com/gagneurlab/absplice/tree/master),
- [kipoi-veff2](https://github.com/kipoi/kipoi-veff2)
- [faatpipe](https://github.com/HealthML/faatpipe), and the
- [vep-plugins repository](https://github.com/Ensembl/VEP_plugins/)

will be installed by the pipeline together with the [plugins](https://www.ensembl.org/info/docs/tools/vep/script/vep_plugins.html) for primateAI and spliceAI. Annotation data for CADD, spliceAI and primateAI should be downloaded. The path to the data may be specified in the corresponding [config file](config/deeprvat_annotation_config.yaml).
Download path:
- [CADD](http://cadd.gs.washington.edu/download): "All possible SNVs of GRCh38/hg38" and "gnomad.genomes.r3.0.indel.tsv.gz" incl. their Tabix Indices
- [SpliceAI](https://basespace.illumina.com/s/otSPW8hnhaZR): "genome_scores_v1.3"/"spliceai_scores.raw.snv.hg38.vcf.gz" and "spliceai_scores.raw.indel.hg38.vcf.gz"
- [PrimateAI](https://basespace.illumina.com/s/yYGFdGih1rXL) PrimateAI supplementary data/"PrimateAI_scores_v0.2_GRCh38_sorted.tsv.bgz"

[CADD](https://github.com/kircherlab/CADD-scripts/tree/master/src/scripts) as well as [VEP](http://www.ensembl.org/info/docs/tools/vep/script/vep_download.html#docker) should be installed together with the [plugins](https://www.ensembl.org/info/docs/tools/vep/script/vep_plugins.html) for primateAI and spliceAI. Annotation data for VEP, CADD, spliceAI and primateAI should be downloaded. The path to the data may be specified in the corresponding [config file](config/deeprvat_annotation_config.yaml).

## Output

The pipeline outputs one annotation file for VEP and one annotation file for CADD for each input vcf-file. Further releases will concatenate and merge the output data into one file.
The pipeline outputs one annotation file for VEP, CADD, DeepRiPe, DeepSea and Absplice for each input vcf-file. The tool further creates concatenated files for each tool and one merged file containing Scores from AbSplice, VEP incl. CADD, primateAI and spliceAI as well as principal components from DeepSea and DeepRiPe.

## Configure the annotation pipeline
The snakemake annotation pipeline is configured using a yaml file with the format akin to the [example file](config/deeprvat_annotation_config.yaml).

The config above would use the following directory structure:
```shell
parent_directory

|-- reference
| |-- GRCh38.primary_assembly.genome.fa.gz
|-- vcf
| |-- metadata
| | |-- pvcf_blocks.txt
| |-- raw
|-- annotations
| |-- tmp
| |-- fasta file


|-- metadata
| |-- pvcf_blocks.txt

|-- preprocessing_workdir
| |--reference
| | |-- fasta file
| |-- norm
| | |-- bcf
| | | |-- bcf_input_files
| | | |-- ...
| | |-- variants
| | | |-- variants.tsv.gz

|-- output_dir
| |-- annotations
| | |-- tmp

|-- repo_dir
| |-- ensembl-vep
| | |-- cache
| | |-- plugins
| |-- abSplice
| |-- faatpipe
| |-- kipoi-veff2

|-- annotation_data
| |-- cadd
| |-- spliceAI
| |-- primateAI
|-- software
| |-- ensembl-vep
| | |-- cache
| | |-- plugins
| |-- CADD-scripts



```
The variant input files are then stored in the `vcf/raw` directory, the reference fasta file is stored in the `reference` folder. The text file mapping blocks to chromosomes is stored in `vcf/metadata` folder. The output is stored in the `annotations` folder and any temporary files in the `tmp` subfolder. VEPwith its corresponding cache as well as scripts for CADD are stored in `software`.

Bcf files created by the [preprocessing pipeline](https://github.com/PMBio/deeprvat/blob/Annotations/deeprvat/preprocessing/README.md) are used as input data.
The pipeline also uses the variant.tsv file as well as the reference file from the preprocesing pipeline.
The pipeline beginns by installing the repositories needed for the annotations, it will automatically install all repositories in the `repo_dir` folder that can be specified in the config file relative to the annotation working directory.
The text file mapping blocks to chromosomes is stored in `metadata` folder. The output is stored in the `output_dir/annotations` folder and any temporary files in the `tmp` subfolder. All repositories used including VEP with its corresponding cache as well as plugins are stored in `repo_dir/ensempl-vep`.
Data for VEP plugins and the CADD cache are stored in `annotation data`.

## Running the annotation pipeline
### Preconfiguration
- Inside the annotation directory create a directory `repo_dir` and run the [annotation setup script](setup_annotation_workflow.sh)
```shell
setup_annotation_workflow.sh repo_dir/ensembl-vep/cache repo_dir/ensembl-vep/Plugins repo_dir
```
or manually clone the repositories mentioned in the [requirements](#requirements) into `repo_dir` and install the needed conda environments with
```shell
mamba env create -f repo_dir/absplice/environment.yaml
mamba env create -f repo_dir/kipoi-veff2/environment.minimal.linux.yml
mamba env create -f deeprvat/deeprvat_annotations.yml
```
If you already have some of the needed repositories on your machine you can edit the paths in the [config](../../pipelines/config/deeprvat_annotation_config.yaml).


After configuration and activating the environment run the pipeline using snakemake:
- Inside the annotation directory create a directory `annotation_dir` and download/link the prescored files for CADD, SpliceAI, and PrimateAI (see [requirements](#requirements))


### Running the pipeline
After configuration and activating the `deeprvat_annotations` environment run the pipeline using snakemake:

```shell
snakemake -j <nr_cores> -s annotations.snakemake --configfile config/deeprvat_annotation.config
snakemake -j <nr_cores> -s annotations.snakemake --configfile config/deeprvat_annotation.config --use-conda
```
## Running the annotation pipeline without the preprocessing pipeline

It is possible to run the annotation pipeline without having run the preprocessing prior to that.
However, the annotation pipeline requires some files from this pipeline that then have to be created manually.
- Left normalized bcf files from the input. These files do not have to contain any genotype information. "chrom, "pos", "ref" and "alt" columns will suffice.
- a reference fasta file will have to be provided
- A tab separated file containing all input variants "chrom, "pos", "ref" and "alt" entries each with a unique id.


## References
<a id="1">[1]</a> Monti, R., Rautenstrauch, P., Ghanbari, M. et al. Identifying interpretable gene-biomarker associations with functionally informed kernel-based tests in 190,000 exomes. Nat Commun 13, 5332 (2022). https://doi.org/10.1038/s41467-022-32864-2

<a id="2">[2]</a> Žiga Avsec et al., “Kipoi: accelerating the community exchange and reuse of predictive models for genomics,” bioRxiv, p. 375345, Jan. 2018, doi: 10.1101/375345.

## Next Releases
Further releases will include further annotation tools like [abSplice](https://github.com/gagneurlab/absplice), [deepSEA](http://deepsea.princeton.edu/job/analysis/create/) and [deepRiPe](https://ohlerlab.mdc-berlin.de/software/DeepRiPe_140/).
Furthermore, annotations will be concatenated and merged into a single file containing annotations from every tool used on every input variant. Support for gene specificity of variants will be also be included in coming releases (e.g. some variants may have several annotations for each gene they are mapped to).
<a id="3">[3]</a>N. Wagner et al., “Aberrant splicing prediction across human tissues,” Nature Genetics, vol. 55, no. 5, pp. 861–870, May 2023, doi: 10.1038/s41588-023-01373-3.
Loading