Skip to content

Commit

Permalink
Merge branch 'main' into annotations_vep_af
Browse files Browse the repository at this point in the history
  • Loading branch information
Marcel-Mueck authored Aug 7, 2024
2 parents 219918f + 5e760de commit daa55f9
Show file tree
Hide file tree
Showing 4 changed files with 15 additions and 17 deletions.
8 changes: 5 additions & 3 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
MIT License
The scource code of DeepRVAT is under MIT license. The pre-trained DeepRVAT models are under the CC BY NC 4.0 license for academic and non-commercial use. This is because DeepRVAT makes use of SpliceAI and PrimateAI scores which are currently under the CC BY NC 4.0 by Illumina.

Copyright (c) 2022, Eva Holtkamp, Brian Clarke, Hakime Öztürk, Felix Brechtmann,
Florian Hölzlwimmer, Julien Gagneur, Oliver Stegle
## MIT License

Copyright (c) 2022, Brian Clarke, Eva Holtkamp, Hakime Öztürk, Marcel Mück, Magnus Wahlberg, Kayla Meyer, Felix Munzlinger, Felix Brechtmann, Florian R. Hölzlwimmer, Jonas Lindner, Zhifen Chen, Julien Gagneur, Oliver Stegle

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
Expand All @@ -20,3 +21,4 @@ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

18 changes: 6 additions & 12 deletions docs/annotations.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,21 +13,16 @@ Furthermore, the pipeline outputs one annotation file for VEP, CADD, DeepRiPe, D

## Input

The pipeline uses left-normalized bcf files containing variant information, a reference fasta file as well as a text file that maps data blocks to chromosomes as input. It is expected that the bcf files contain the columns "CHROM" "POS" "ID" "REF" and "ALT".
Any other columns, including genotype information are stripped from the data before annotation tools are used on the data. The variants may be split into several vcf files for each chromosome and each "block" of data.

The pipeline uses left-normalized bcf files containing variant information (e.g. the bcf files created by the[preprocessing pipeline](https://deeprvat.readthedocs.io/en/latest/preprocessing.html)) , a reference fasta file, and a gtf file for gene information. It is expected that the bcf files contain the columns "CHROM" "POS" "ID" "REF" and "ALT".
Any other columns, including genotype information, are stripped from the data before annotation tools are used on the data. The variants may be split into several vcf files for each chromosome and each "block" of data.
The filenames should then contain the corresponding chromosome and block number. The pattern of the file names, as well as file structure may be specified in the corresponding [config file](https://github.com/PMBio/deeprvat/blob/main/example/config/deeprvat_annotation_config.yaml). The pipeline also requires input data and repositories descried in [requirements](#requirements).

(requirements)=
## Requirements

BCFtools as well as HTSlib should be installed on the machine,
- [CADD](https://github.com/kircherlab/CADD-scripts/tree/master/src/scripts) as well as
- [VEP](http://www.ensembl.org/info/docs/tools/vep/script/vep_download.html),
- [kipoi-veff2](https://github.com/kipoi/kipoi-veff2)
- [faatpipe](https://github.com/HealthML/faatpipe), and the
- [vep-plugins repository](https://github.com/Ensembl/VEP_plugins/)
BCFtools as well as HTSlib should be installed on the machine, [VEP](http://www.ensembl.org/info/docs/tools/vep/script/vep_download.html) should be installed for running the pipeline. The [faatpipe](https://github.com/HealthML/faatpipe) repo, [kipoi-veff2](https://github.com/kipoi/kipoi-veff2) repo and [vep-plugins repository](https://github.com/Ensembl/VEP_plugins/) should be cloned. Annotation data for CADD, spliceAI and primateAI should be downloaded. The path to the data may be specified in the corresponding [config file](https://github.com/PMBio/deeprvat/blob/main/example/config/deeprvat_annotation_config.yaml).

should be installed for running the pipeline, together with the [plugins](https://www.ensembl.org/info/docs/tools/vep/script/vep_plugins.html) for primateAI and spliceAI. Annotation data for CADD, spliceAI and primateAI should be downloaded. The path to the data may be specified in the corresponding [config file](https://github.com/PMBio/deeprvat/blob/main/example/config/deeprvat_annotation_config.yaml).
Download paths:
- [CADD](https://cadd.bihealth.org/download): "All possible SNVs of GRCh38/hg38" and "gnomad.genomes.r3.0.indel.tsv.gz" incl. their Tabix Indices
- [SpliceAI](https://basespace.illumina.com/s/otSPW8hnhaZR): "genome_scores_v1.3"/"spliceai_scores.raw.snv.hg38.vcf.gz" and "spliceai_scores.raw.indel.hg38.vcf.gz"
Expand Down Expand Up @@ -95,15 +90,14 @@ Data for VEP plugins and the CADD cache are stored in `annotation data`.
mamba activate deeprvat_annotations
pip install -e path/to/deeprvat
```
- Clone the repositories mentioned in [requirements](#requirements) into `repo_dir` and install the needed conda environments with
- Clone the repositories mentioned in [requirements](#requirements) into `repo_dir` and install the required kipoi-veff2 conda environment with
```shell
mamba env create -f repo_dir/kipoi-veff2/environment.minimal.linux.yml
mamba env create -f deeprvat/deeprvat_annotations.yml
```
If you already have some of the needed repositories on your machine you can edit the paths in the [config](https://github.com/PMBio/deeprvat/blob/main/example/config/deeprvat_annotation_config.yaml).


- Inside the annotation directory create a directory `annotation_dir` and download/link the prescored files for CADD, SpliceAI, and PrimateAI (see [requirements](#requirements))
- Inside the annotation directory create a directory `annotation_data` and download/link the prescored files for CADD, SpliceAI, and PrimateAI (see [requirements](#requirements))


### Running the pipeline
Expand Down
2 changes: 1 addition & 1 deletion docs/preprocessing.md
Original file line number Diff line number Diff line change
Expand Up @@ -198,7 +198,7 @@ gzip -d workdir/reference/GRCh38.primary_assembly.genome.fa.gz
4. Run with the example config

```shell
snakemake -j 1 --snakefile ../../pipelines/preprocess_no_qc.snakefile --configfile ../../pipelines/config/deeprvat_preprocess_config.yaml
snakemake -j 1 --snakefile ../../pipelines/preprocess_no_qc.snakefile --configfile ../../example/config/deeprvat_preprocess_config.yaml
```

5. Enjoy the preprocessed data 🎉
Expand Down
4 changes: 3 additions & 1 deletion docs/pretrained_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,10 @@ For using the pretrained DeepRVAT model provided as part of the package, or a cu

Configuration parameters must be specified in `deeprvat_input_pretrained_models_config.yaml`, see [example file](https://github.com/PMBio/deeprvat/blob/main/example/config/deeprvat_input_pretrained_models_config.yaml). For details on the meanings of the parameters and the format of input files, see [here](input_data).


To use pretrained models, you must specify `use_pretrained_models: True` in your `deeprvat_input_pretrained_models_config.yaml` configuration file. Additionally, provide the path to pretrained models (an output of the training pipeline) in the parameter `pretrained_model_path`. Within the `pretrained_model_path` directory, there must be a `config.yaml` file in that directory with the following set of specified keys that were used for training the pretrained models; `rare_variant_annotations`, `training_data_thresholds`, and `model` . See [example file](https://github.com/PMBio/deeprvat/blob/main/example/config/deeprvat_input_pretrained_models_config.yaml).


Below outlines the configuration parameters specified in `deeprvat_input_pretrained_models_config.yaml`.

The following parameters specify the locations of required input files:
Expand All @@ -33,7 +35,7 @@ association_testing_data_thresholds (optional)
cv_options (optional)
```

Note that the file specified by `annotation_filename` must contain a column corresponding to each annotation in the list `rare_variant_annotations` from `deeprvat/pretrained_models/config.yaml`.
Note that the file specified by `annotation_filename` must contain a column corresponding to each annotation in the list `rare_variant_annotations` from `deeprvat/pretrained_models/model_config.yaml`.


## Executing the pipeline
Expand Down

0 comments on commit daa55f9

Please sign in to comment.