Merge branch 'main' into annotations_vep_af

PMBio · Aug 7, 2024 · daa55f9 · daa55f9
2 parents 219918f + 5e760de
commit daa55f9
Show file tree

Hide file tree

Showing 4 changed files with 15 additions and 17 deletions.
diff --git a/LICENSE b/LICENSE
@@ -1,7 +1,8 @@
-MIT License
+The scource code of DeepRVAT is under MIT license. The pre-trained DeepRVAT models are under the CC BY NC 4.0 license for academic and non-commercial use. This is because DeepRVAT makes use of SpliceAI and PrimateAI scores which are currently under the CC BY NC 4.0 by Illumina.
 
-Copyright (c) 2022, Eva Holtkamp, Brian Clarke, Hakime Öztürk, Felix Brechtmann,
-Florian Hölzlwimmer, Julien Gagneur, Oliver Stegle
+## MIT License
+
+Copyright (c) 2022, Brian Clarke, Eva Holtkamp, Hakime Öztürk, Marcel Mück, Magnus Wahlberg, Kayla Meyer, Felix Munzlinger, Felix Brechtmann, Florian R. Hölzlwimmer, Jonas Lindner, Zhifen Chen, Julien Gagneur, Oliver Stegle
 
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal
@@ -20,3 +21,4 @@ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 SOFTWARE.
+
diff --git a/docs/annotations.md b/docs/annotations.md
@@ -13,21 +13,16 @@ Furthermore, the pipeline outputs one annotation file for VEP, CADD, DeepRiPe, D
 
 ## Input
 
-The pipeline uses left-normalized bcf files containing variant information, a reference fasta file as well as a text file that maps data blocks to chromosomes as input. It is expected that the bcf files contain the columns "CHROM" "POS" "ID" "REF" and "ALT". 
-Any other columns, including genotype information are stripped from the data before annotation tools are used on the data. The variants may be split into several vcf files for each chromosome and each "block" of data. 
+
+The pipeline uses left-normalized bcf files containing variant information (e.g. the bcf files created by the[preprocessing pipeline](https://deeprvat.readthedocs.io/en/latest/preprocessing.html)) , a reference fasta file, and a gtf file for gene information. It is expected that the bcf files contain the columns "CHROM" "POS" "ID" "REF" and "ALT". 
+Any other columns, including genotype information, are stripped from the data before annotation tools are used on the data. The variants may be split into several vcf files for each chromosome and each "block" of data. 
 The filenames should then contain the corresponding chromosome and block number. The pattern of the file names, as well as file structure may be specified in the corresponding [config file](https://github.com/PMBio/deeprvat/blob/main/example/config/deeprvat_annotation_config.yaml). The pipeline also requires input data and repositories descried in [requirements](#requirements).
 
 (requirements)=
 ## Requirements
 
-BCFtools as well as HTSlib should be installed on the machine, 
-- [CADD](https://github.com/kircherlab/CADD-scripts/tree/master/src/scripts) as well as 
-- [VEP](http://www.ensembl.org/info/docs/tools/vep/script/vep_download.html),  
-- [kipoi-veff2](https://github.com/kipoi/kipoi-veff2)
-- [faatpipe](https://github.com/HealthML/faatpipe), and the
-- [vep-plugins repository](https://github.com/Ensembl/VEP_plugins/)
+BCFtools as well as HTSlib should be installed on the machine, [VEP](http://www.ensembl.org/info/docs/tools/vep/script/vep_download.html) should be installed for running the pipeline. The [faatpipe](https://github.com/HealthML/faatpipe) repo, [kipoi-veff2](https://github.com/kipoi/kipoi-veff2) repo and  [vep-plugins repository](https://github.com/Ensembl/VEP_plugins/) should be cloned. Annotation data for CADD, spliceAI and primateAI should be downloaded. The path to the data may be specified in the corresponding [config file](https://github.com/PMBio/deeprvat/blob/main/example/config/deeprvat_annotation_config.yaml). 
 
-should be installed for running the pipeline, together with the [plugins](https://www.ensembl.org/info/docs/tools/vep/script/vep_plugins.html) for primateAI and spliceAI. Annotation data for CADD, spliceAI and primateAI should be downloaded. The path to the data may be specified in the corresponding [config file](https://github.com/PMBio/deeprvat/blob/main/example/config/deeprvat_annotation_config.yaml). 
 Download paths:
 - [CADD](https://cadd.bihealth.org/download): "All possible SNVs of GRCh38/hg38" and "gnomad.genomes.r3.0.indel.tsv.gz" incl. their Tabix Indices
 - [SpliceAI](https://basespace.illumina.com/s/otSPW8hnhaZR): "genome_scores_v1.3"/"spliceai_scores.raw.snv.hg38.vcf.gz" and "spliceai_scores.raw.indel.hg38.vcf.gz" 
@@ -95,15 +90,14 @@ Data for VEP plugins and the CADD cache are stored in `annotation data`.
     mamba activate deeprvat_annotations
     pip install -e path/to/deeprvat
     ```
-- Clone the repositories mentioned in [requirements](#requirements) into `repo_dir` and install the needed conda environments with  
+- Clone the repositories mentioned in [requirements](#requirements) into `repo_dir` and install the required kipoi-veff2 conda environment with  
     ```shell
     mamba env create -f repo_dir/kipoi-veff2/environment.minimal.linux.yml
-    mamba env create -f deeprvat/deeprvat_annotations.yml
     ```
   If you already have some of the needed repositories on your machine you can edit the paths in the [config](https://github.com/PMBio/deeprvat/blob/main/example/config/deeprvat_annotation_config.yaml).
 
 
-- Inside the annotation directory create a directory `annotation_dir` and download/link the prescored files for CADD, SpliceAI, and PrimateAI (see [requirements](#requirements))
+- Inside the annotation directory create a directory `annotation_data` and download/link the prescored files for CADD, SpliceAI, and PrimateAI (see [requirements](#requirements))
 
 
 ### Running the pipeline

diff --git a/docs/preprocessing.md b/docs/preprocessing.md
@@ -198,7 +198,7 @@ gzip -d workdir/reference/GRCh38.primary_assembly.genome.fa.gz
 4. Run with the example config
 
 ```shell
-snakemake -j 1 --snakefile ../../pipelines/preprocess_no_qc.snakefile --configfile ../../pipelines/config/deeprvat_preprocess_config.yaml
+snakemake -j 1 --snakefile ../../pipelines/preprocess_no_qc.snakefile --configfile ../../example/config/deeprvat_preprocess_config.yaml
 ```
 
 5. Enjoy the preprocessed data 🎉

diff --git a/docs/pretrained_models.md b/docs/pretrained_models.md
@@ -7,8 +7,10 @@ For using the pretrained DeepRVAT model provided as part of the package, or a cu
 
 Configuration parameters must be specified in `deeprvat_input_pretrained_models_config.yaml`, see [example file](https://github.com/PMBio/deeprvat/blob/main/example/config/deeprvat_input_pretrained_models_config.yaml). For details on the meanings of the parameters and the format of input files, see [here](input_data).
 
+
 To use pretrained models, you must specify `use_pretrained_models: True` in your `deeprvat_input_pretrained_models_config.yaml` configuration file. Additionally, provide the path to pretrained models (an output of the training pipeline) in the parameter `pretrained_model_path`. Within the `pretrained_model_path` directory, there must be a `config.yaml` file in that directory with the following set of specified keys that were used for training the pretrained models; `rare_variant_annotations`, `training_data_thresholds`, and `model` . See [example file](https://github.com/PMBio/deeprvat/blob/main/example/config/deeprvat_input_pretrained_models_config.yaml).
 
+
 Below outlines the configuration parameters specified in `deeprvat_input_pretrained_models_config.yaml`.
 
 The following parameters specify the locations of required input files:
@@ -33,7 +35,7 @@ association_testing_data_thresholds (optional)
 cv_options (optional)
 ```
 
-Note that the file specified by `annotation_filename` must contain a column corresponding to each annotation in the list `rare_variant_annotations` from `deeprvat/pretrained_models/config.yaml`. 
+Note that the file specified by `annotation_filename` must contain a column corresponding to each annotation in the list `rare_variant_annotations` from `deeprvat/pretrained_models/model_config.yaml`. 
 
 
 ## Executing the pipeline