Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Annotation speedups #53

Closed
wants to merge 80 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
80 commits
Select commit Hold shift + click to select a range
efd46a8
added support for deepRiPe and deepSea, ability to concat and merge a…
Aug 4, 2023
1545c49
changed config file
Aug 4, 2023
44702f9
Create deeprvat_annotations.yml
endast Aug 4, 2023
a5fed75
Update deeprvat_annotations.yml
endast Aug 4, 2023
bb66bce
cleaned up commented out code in annotations.py
Aug 4, 2023
f0e1b86
Merge branch 'Annotations' of github.com:PMBio/deeprvat into Annotations
Aug 4, 2023
98aad4e
Update deeprvat_annotations.yml
endast Aug 7, 2023
a012f2c
Update deeprvat_annotations.yml
endast Aug 7, 2023
b58f654
Update deeprvat_annotations.yml
endast Aug 7, 2023
d6ef2f5
Use python=3.9.16
endast Aug 7, 2023
74dd04e
Update deeprvat_annotations.yml
endast Aug 7, 2023
cfe68ca
increased memory resource for rule concat_deepSea
Aug 7, 2023
deec6b7
Merge branch 'main' into Annotations
Aug 7, 2023
6e72396
Merge branch 'Annotations' of github.com:PMBio/deeprvat into Annotations
Aug 7, 2023
5091be0
Update deeprvat_annotations.yml
endast Aug 8, 2023
d8806e2
deleted obsolete environment yaml file; removed absolute path from an…
Aug 8, 2023
b327392
Merge branch 'Annotations' of github.com:PMBio/deeprvat into Annotations
Aug 8, 2023
03218c8
updated environment, config, removed dask dependency
Aug 14, 2023
3aac39a
changed annotations.py to make concatenation
Aug 23, 2023
c987400
added support for different deepRIPE models (hg2, k5 and parclip)
Aug 24, 2023
9a8391c
fixup! Format Python code with psf/black pull_request
Aug 24, 2023
44acb9e
changed some temporary paths
Aug 24, 2023
33a5b78
filtering and aggregating vep columns
Aug 25, 2023
e18412c
fixup! Format Python code with psf/black pull_request
Aug 25, 2023
5d2c32c
Update README.md
Marcel-Mueck Aug 25, 2023
5f9a2bc
Update README.md
Marcel-Mueck Aug 25, 2023
16206dc
Update README.md
Marcel-Mueck Aug 25, 2023
f707b9e
Update README.md
Marcel-Mueck Aug 25, 2023
5d65cdf
speedups using joblib and batching
bfclarke Sep 12, 2023
1fec6f2
remove unused function
bfclarke Sep 13, 2023
530da1f
n_jobs as parameter
Sep 14, 2023
a85e404
Small changes
Sep 28, 2023
4be33e8
small changes
Sep 28, 2023
1bff884
elaborated data sources in README for annotations prescored data
Oct 2, 2023
c2ffb57
added required repos to requirements
Oct 5, 2023
87a33ba
clarified that user has to actiavte deeprvat_annotations
Oct 5, 2023
e91ca3d
added multithreading for deepripe scores to snakemake file
Oct 6, 2023
ebca4f0
changed repo name
Oct 6, 2023
2d06c10
fixed input error on merge_deepripe rules
Oct 6, 2023
a5125a8
added per-gene option in vep. no filtering to gene-variant pair level…
Oct 6, 2023
c3347c1
added VEP plugin repo and faatpipe repo to config and setup.sh file
Oct 9, 2023
f3df8a8
bug fix
Oct 9, 2023
ae950cc
Update ensembl-vep version
endast Oct 11, 2023
df20e1b
Make use of modules optional
endast Oct 13, 2023
2c62beb
added merging of absplice ond deepsea pca to complete annotations in …
Oct 13, 2023
138958b
Merge branch 'deepripe-batch' of github.com:PMBio/deeprvat into deepr…
Oct 13, 2023
11c69a5
Fix processing of Uploaded_variation col
endast Oct 13, 2023
d9b303b
Merge branch 'deepripe-batch' of https://github.com/PMBio/deeprvat in…
endast Oct 13, 2023
81324a0
add vep to conda env
endast Oct 13, 2023
17580a3
Merge branch 'deepripe-batch' of github.com:PMBio/deeprvat into deepr…
endast Oct 13, 2023
15f50bc
Fix snakemake typos
endast Oct 13, 2023
3bb17c6
Remoce unused import
endast Oct 13, 2023
e29fcc0
Fix split of cols
endast Oct 15, 2023
7be1915
Only read the file in one place
endast Oct 15, 2023
132efc4
fixed path names in snakemake rules
Oct 16, 2023
7cdb6ab
Merge branch 'deepripe-batch' of github.com:PMBio/deeprvat into deepr…
Oct 16, 2023
22b979a
small changes
Oct 16, 2023
5995e8b
Use gz files
endast Oct 16, 2023
c9bb44b
Update annotations.snakefile
endast Oct 16, 2023
c4b452c
cast - as na in vep files
endast Oct 17, 2023
0599c89
pca object is saved and loaded as np array of components by default.
Oct 17, 2023
5affade
Merge branch 'deepripe-batch' of github.com:PMBio/deeprvat into deepr…
Oct 17, 2023
48944dd
parallelised deepripe merge and use parquet instead of csv for pca
Oct 25, 2023
51a59d9
use parquet files for pca input
Oct 25, 2023
d7f3d1d
added docstrings to each function
Nov 22, 2023
f3e3ee2
changed vep cmd to include only scores (not class label )for PolyPhen…
Jan 29, 2024
6225278
removed redundant vep_repo_path from config
Jan 29, 2024
8ba9c47
Merge branch 'main' into annotation-speedups
Feb 2, 2024
4a47768
added final steps of the annotation processing: calculating af, maf a…
Feb 13, 2024
6047dd5
Create empty gencode.v44.annotation.gtf.gz
Marcel-Mueck Feb 14, 2024
c5a904a
Create genotypes.h5
Marcel-Mueck Feb 14, 2024
06f0448
Update deeprvat_annotation_config.yaml
Marcel-Mueck Feb 14, 2024
763cc54
removed commented out lines
Marcel-Mueck Feb 14, 2024
5a1c8dc
changed path for primateAI File in annotation config
Marcel-Mueck Feb 14, 2024
2602240
Removed commented out lines in annotation pipeline and source
Marcel-Mueck Feb 14, 2024
1cffa6c
removed duplicated 'merge_deepsea_pcas' function
Marcel-Mueck Feb 14, 2024
8f3560e
updated annotation rulegraph
Marcel-Mueck Feb 15, 2024
cd217f5
Update deeprvat_annotation_config.yaml
Marcel-Mueck Feb 20, 2024
d471a0b
Update absplice_download_Snakefile
Marcel-Mueck Feb 20, 2024
3a05645
fixup! Format Python code with psf/black pull_request
Feb 20, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,546 changes: 1,353 additions & 193 deletions deeprvat/annotations/annotations.py

Large diffs are not rendered by default.

8 changes: 5 additions & 3 deletions deeprvat/deeprvat/evaluate.py
Original file line number Diff line number Diff line change
Expand Up @@ -162,9 +162,11 @@ def get_pvals(results, method_mapping=None, phenotype_mapping={}):

if phenotype_mapping is not None:
pvals["phenotype"] = pvals["phenotype"].apply(
lambda x: phenotype_mapping[x]
if x in phenotype_mapping
else " ".join(x.split("_"))
lambda x: (
phenotype_mapping[x]
if x in phenotype_mapping
else " ".join(x.split("_"))
)
)

return pvals
Expand Down
12 changes: 6 additions & 6 deletions deeprvat/seed_gene_discovery/seed_gene_discovery.py
Original file line number Diff line number Diff line change
Expand Up @@ -495,9 +495,9 @@ def update_config(
if variant_type is not None:
logger.info(f"Variant type is {variant_type}")
if variant_type == "missense":
rare_threshold_config[
"Consequence_missense_variant"
] = "Consequence_missense_variant == 1"
rare_threshold_config["Consequence_missense_variant"] = (
"Consequence_missense_variant == 1"
)
elif variant_type == "plof":
rare_threshold_config["is_plof"] = "is_plof == 1"
elif variant_type == "all":
Expand All @@ -517,9 +517,9 @@ def update_config(
if rare_maf is not None:
logger.info(f"setting association testing maf to {rare_maf}")
config["data"]["dataset_config"]["min_common_af"][maf_column] = rare_maf
rare_threshold_config[
maf_column
] = f"{maf_column} < {rare_maf} and {maf_column} > 0"
rare_threshold_config[maf_column] = (
f"{maf_column} < {rare_maf} and {maf_column} > 0"
)

logger.info(f"Rare variant thresholds: {rare_threshold_config}")
with open(new_config_file, "w") as f:
Expand Down
3 changes: 2 additions & 1 deletion deeprvat_annotations.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,4 +19,5 @@ dependencies:
#comment out lines below if you want to use preinstalled bcftools or samtools
- bcftools=1.17
- samtools=1.17
- ensembl-vep=110.1
- ensembl-vep=110.1
- pyranges
273 changes: 273 additions & 0 deletions docs/_static/annotation_rulegraph.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
28 changes: 17 additions & 11 deletions docs/annotations.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
# DeepRVAT Annotation pipeline

This pipeline is based on [snakemake](https://snakemake.readthedocs.io/en/stable/). It uses [bcftools + samstools](https://www.htslib.org/), as well as [perl](https://www.perl.org/), [deepRiPe](https://ohlerlab.mdc-berlin.de/software/DeepRiPe_140/) and [deepSEA](http://deepsea.princeton.edu/) as well as [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html), including plugins for [primateAI](https://github.com/Illumina/PrimateAI) and [spliceAI](https://github.com/Illumina/SpliceAI). DeepRiPe annotations were acquired using [faatpipe repository by HealthML](https://github.com/HealthML/faatpipe)[[1]](#reference-1-target) and DeepSea annotations were calculated using [kipoi-veff2](https://github.com/kipoi/kipoi-veff2)[[2]](#reference-2-target), abSplice scores were computet using [abSplice](https://github.com/gagneurlab/absplice/)[[3]](#reference-3-target)
This pipeline is based on [snakemake](https://snakemake.readthedocs.io/en/stable/). It uses [bcftools + samstools](https://www.htslib.org/), [perl](https://www.perl.org/), [deepRiPe](https://ohlerlab.mdc-berlin.de/software/DeepRiPe_140/) and [deepSEA](http://deepsea.princeton.edu/) as well as [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html), including plugins for [primateAI](https://github.com/Illumina/PrimateAI) and [spliceAI](https://github.com/Illumina/SpliceAI). DeepRiPe annotations were acquired using [faatpipe repository by HealthML](https://github.com/HealthML/faatpipe)[[1]](#reference-1-target) and DeepSea annotations were calculated using [kipoi-veff2](https://github.com/kipoi/kipoi-veff2)[[2]](#reference-2-target), abSplice scores were computed using [abSplice](https://github.com/gagneurlab/absplice/)[[3]](#reference-3-target)

![dag](_static/annotation_pipeline_dag.png)
![dag](_static/annotation_rulegraph.svg)
*Figure 1: Example DAG of annoation pipeline using only two bcf files as input.*

## Input

The pipeline uses left-normalized bcf files containing variant information, a reference fasta file as well as a text file that maps data blocks to chromosomes as input. It is expected that the bcf files contain the columns "CHROM" "POS" "ID" "REF" and "ALT". Any other columns, including genotype information are stripped from the data before annotation tools are used on the data. The variants may be split into several vcf files for each chromosome and each "block" of data. The filenames should then contain the corresponding chromosome and block number. The pattern of the file names, as well as file structure may be specified in the corresponding [config file](https://github.com/PMBio/deeprvat/blob/main/pipelines/config/deeprvat_annotation_config.yaml).

(requirements-target)=

## Requirements

BCFtools as well as HTSlib should be installed on the machine,
Expand Down Expand Up @@ -39,6 +39,7 @@ The config above would use the following directory structure:

|-- reference
| |-- fasta file
| |-- gtf file


|-- metadata
Expand Down Expand Up @@ -70,33 +71,38 @@ The config above would use the following directory structure:
| |-- cadd
| |-- spliceAI
| |-- primateAI

| |-- AlphaMissense


```

Bcf files created by the [preprocessing pipeline](preprocessing.md) are used as input data.
The pipeline also uses the variant.tsv file as well as the reference file from the preprocesing pipeline.
The pipeline beginns by installing the repositories needed for the annotations, it will automatically install all repositories in the `repo_dir` folder that can be specified in the config file relative to the annotation working directory.
Alternatively, vcf files may be used as input data as well.
The pipeline also uses the variant.tsv file as well as the genotype file from the preprocesing pipeline.
The text file mapping blocks to chromosomes is stored in `metadata` folder. The output is stored in the `output_dir/annotations` folder and any temporary files in the `tmp` subfolder. All repositories used including VEP with its corresponding cache as well as plugins are stored in `repo_dir/ensempl-vep`.
Data for VEP plugins and the CADD cache are stored in `annotation data`.
Data for VEP plugins and the CADD cache may be stored in `annotation data`.

## Running the annotation pipeline
### Preconfiguration
- Inside the annotation directory create a directory `repo_dir` and run the [annotation setup script](https://github.com/PMBio/deeprvat/blob/main/deeprvat/annotations/setup_annotation_workflow.sh)
```shell
setup_annotation_workflow.sh repo_dir/ensembl-vep/cache repo_dir/ensembl-vep/Plugins repo_dir
```
or manually clone the repositories mentioned in the [requirements](#requirements-target) into `repo_dir` and install the needed conda environments with
or manually clone the repositories mentioned in the [requirements](#requirements) into `repo_dir` and install the needed conda environments with
```shell
mamba env create -f repo_dir/absplice/environment.yaml
mamba activate absplice
pip install -e repo_dir/absplice
mamba env create -f repo_dir/kipoi-veff2/environment.minimal.linux.yml
mamba activate kipoi-veff2
pip install -e repo_dir/kipoi-veff2
mamba env create -f deeprvat/deeprvat_annotations.yml
mamba activate deeprvat_annotations
```
If you already have some of the needed repositories on your machine you can edit the paths in the [config](https://github.com/PMBio/deeprvat/blob/main/pipelines/config/deeprvat_annotation_config.yaml).
If you already have some of the needed repositories on your machine you can edit the paths in the [config](https://github.com/PMBio/deeprvat/blob/main/pipelines/config/deeprvat_annotation_config.yaml). Or link the repositories into repo_dir.


- Inside the annotation directory create a directory `annotation_dir` and download/link the prescored files for CADD, SpliceAI, and PrimateAI (see [requirements](#requirements-target))
- Inside the annotation directory create a directory `annotation_dir` and download/link the prescored files for CADD, SpliceAI, and PrimateAI (see [requirements](#requirements))


### Running the pipeline
Expand All @@ -109,7 +115,7 @@ After configuration and activating the `deeprvat_annotations` environment run th

It is possible to run the annotation pipeline without having run the preprocessing prior to that.
However, the annotation pipeline requires some files from this pipeline that then have to be created manually.
- Left normalized bcf files from the input. These files do not have to contain any genotype information. "chrom, "pos", "ref" and "alt" columns will suffice.
- Left normalized bcf or vcf files from the input. These files do not have to contain any genotype information. "chrom, "pos", "ref" and "alt" columns will suffice.
- a reference fasta file will have to be provided
- A tab separated file containing all input variants "chrom, "pos", "ref" and "alt" entries each with a unique id.

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@

Original file line number Diff line number Diff line change
@@ -0,0 +1 @@

Loading