From a1f731f3f0a0162b50550a498987d1f6cf945338 Mon Sep 17 00:00:00 2001 From: Brian Clarke Date: Thu, 29 Feb 2024 14:22:06 +0100 Subject: [PATCH 01/15] move installation into own section --- docs/index.rst | 3 ++- docs/installation.md | 21 +++++++++++++++++++++ docs/{usage.md => quickstart.md} | 28 ++-------------------------- 3 files changed, 25 insertions(+), 27 deletions(-) create mode 100644 docs/installation.md rename docs/{usage.md => quickstart.md} (65%) diff --git a/docs/index.rst b/docs/index.rst index fcabaffc..7629ff4f 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -13,7 +13,8 @@ Rare variant association testing using deep learning and data-driven burden scor :maxdepth: 2 :caption: Contents: - usage.md + installation.md + quickstart.md preprocessing.md annotations.md seed_gene_discovery.md diff --git a/docs/installation.md b/docs/installation.md new file mode 100644 index 00000000..a3e61d8c --- /dev/null +++ b/docs/installation.md @@ -0,0 +1,21 @@ +# Installation + +1. Clone this repository: +```shell +git clone git@github.com:PMBio/deeprvat.git +``` +1. Change directory to the repository: `cd deeprvat` +1. Install the conda environment. We recommend using [mamba](https://mamba.readthedocs.io/en/latest/index.html), though you may also replace `mamba` with `conda` + + *Note: [the current deeprvat env does not support cuda when installed with conda](https://github.com/PMBio/deeprvat/issues/16), install using mamba for cuda support.* +```shell +mamba env create -n deeprvat -f deeprvat_env.yaml +``` +1. Activate the environment: `mamba activate deeprvat` +1. Install the `deeprvat` package: `pip install -e .` + +If you don't want to install the GPU-related requirements, use the `deeprvat_env_no_gpu.yml` environment instead. +```shell +mamba env create -n deeprvat -f deeprvat_env_no_gpu.yaml +``` + diff --git a/docs/usage.md b/docs/quickstart.md similarity index 65% rename from docs/usage.md rename to docs/quickstart.md index 5d7c9170..19bb9ac4 100644 --- a/docs/usage.md +++ b/docs/quickstart.md @@ -1,27 +1,3 @@ -# Using DeepRVAT - -## Installation - -1. Clone this repository: -```shell -git clone git@github.com:PMBio/deeprvat.git -``` -1. Change directory to the repository: `cd deeprvat` -1. Install the conda environment. We recommend using [mamba](https://mamba.readthedocs.io/en/latest/index.html), though you may also replace `mamba` with `conda` - - *note: [the current deeprvat env does not support cuda when installed with conda](https://github.com/PMBio/deeprvat/issues/16), install using mamba for cuda support.* -```shell -mamba env create -n deeprvat -f deeprvat_env.yaml -``` -1. Activate the environment: `mamba activate deeprvat` -1. Install the `deeprvat` package: `pip install -e .` - -If you don't want to install the gpu related requirements use the `deeprvat_env_no_gpu.yml` environment instead. -```shell -mamba env create -n deeprvat -f deeprvat_env_no_gpu.yaml -``` - - ## Basic usage ### Customize pipelines @@ -31,7 +7,7 @@ Before running any of the snakefiles, you may want to adjust the number of threa If you are running on a computing cluster, you will need a [profile](https://github.com/snakemake-profiles) and may need to add `resources:` directives to the snakefiles. -### Run the preprocessing pipeline on VCF files +### Run the preprocessing pipeline on your VCF files Instructions [here](preprocessing.md) @@ -42,7 +18,7 @@ Instructions [here](annotations.md) -### Try the full training and association testing pipeline on some example data +### Run the full training and association testing pipeline on some example data ```shell mkdir example From be1e80a89c8ca90a9d3c9d01ea78fd37a3ca80e5 Mon Sep 17 00:00:00 2001 From: Brian Clarke Date: Thu, 11 Apr 2024 16:50:53 +0200 Subject: [PATCH 02/15] restructure docs --- docs/deeprvat.md | 38 +++++++++++++++++++++++++++++++ docs/general.md | 18 +++++++++++++++ docs/index.rst | 4 ++++ docs/quickstart.md | 45 ++++++++++++++++++++----------------- docs/seed_gene_discovery.md | 14 ++++++------ 5 files changed, 92 insertions(+), 27 deletions(-) create mode 100644 docs/deeprvat.md create mode 100644 docs/general.md diff --git a/docs/deeprvat.md b/docs/deeprvat.md new file mode 100644 index 00000000..145c16c7 --- /dev/null +++ b/docs/deeprvat.md @@ -0,0 +1,38 @@ +# Training and association testing with DeepRVAT + +*TODO:* Overview of procedure, multiple flavors (with training, with pretrained models, with precomputed burdens, with/without REGENIE) + + +## Configuration file: Common parameters + +*TODO:* Describe common parameters, give example + + +## Training + +### Configuration file + +### Running the training pipeline + + +## Association testing + +### Configuration file + +### Running the association testing pipeline with REGENIE + +### Running the association testing pipeline with SEAK + + +## Training and association testing with a combined pipeline + +### Configuration file + +### Running the training and association testing pipeline with REGENIE + +### Running the training and association testing pipeline with SEAK + + +## Running only a portion of any pipeline + +*TODO:* Point to modular breakdowns of pipelines diff --git a/docs/general.md b/docs/general.md new file mode 100644 index 00000000..bcc6e71d --- /dev/null +++ b/docs/general.md @@ -0,0 +1,18 @@ +# General considerations + +## Pipeline resource requirements + +*TODO:* Note that pipelines have some suggested resource requirements, may need to be adjusted for cluster execution + + +## Cluster execution + +*TODO:* Point to snakemake profiles + + +## Execution on GPU vs. CPU + +*TODO:* Two rules that use GPU. Training pretty much requires GPU, burden computation is okay on CPU but substantially slower + + +## *TODO:* Add any other points? diff --git a/docs/index.rst b/docs/index.rst index 7629ff4f..362b0b26 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -15,9 +15,13 @@ Rare variant association testing using deep learning and data-driven burden scor installation.md quickstart.md + general.md preprocessing.md annotations.md seed_gene_discovery.md + deeprvat.md + config.md + ukbiobank.md apidocs/index diff --git a/docs/quickstart.md b/docs/quickstart.md index 19bb9ac4..dec65a71 100644 --- a/docs/quickstart.md +++ b/docs/quickstart.md @@ -1,61 +1,66 @@ -## Basic usage +# Basic usage -### Customize pipelines +## Customize pipelines Before running any of the snakefiles, you may want to adjust the number of threads used by different steps in the pipeline. To do this, modify the `threads:` property of a given rule. If you are running on a computing cluster, you will need a [profile](https://github.com/snakemake-profiles) and may need to add `resources:` directives to the snakefiles. -### Run the preprocessing pipeline on your VCF files +## Run the preprocessing pipeline on your VCF files Instructions [here](preprocessing.md) -### Annotate variants +## Annotate variants Instructions [here](annotations.md) +## Example DeepRVAT runs + +In each case, replace `[path_to_deeprvat]` with the path to your clone of the repository. + +Note that the example data used here is randomly generated, and so is only suited for testing whether the `deeprvat` package has been correctly installed. + ### Run the full training and association testing pipeline on some example data ```shell -mkdir example -cd example +mkdir deeprvat_train_associate +cd deeprvat_train_associate ln -s [path_to_deeprvat]/example/* . snakemake -j 1 --snakefile [path_to_deeprvat]/pipelines/training_association_testing.snakefile ``` -Replace `[path_to_deeprvat]` with the path to your clone of the repository. - -Note that the example data is randomly generated, and so is only suited for testing whether the `deeprvat` package has been correctly installed. - ### Run the training pipeline on some example data ```shell -mkdir example -cd example +mkdir deeprvat_train +cd deeprvat_train ln -s [path_to_deeprvat]/example/* . snakemake -j 1 --snakefile [path_to_deeprvat]/pipelines/run_training.snakefile ``` -Replace `[path_to_deeprvat]` with the path to your clone of the repository. - -Note that the example data is randomly generated, and so is only suited for testing whether the `deeprvat` package has been correctly installed. - ### Run the association testing pipeline with pretrained models ```shell -mkdir example -cd example +mkdir deeprvat_associate +cd deeprvat_associate ln -s [path_to_deeprvat]/example/* . ln -s [path_to_deeprvat]/pretrained_models snakemake -j 1 --snakefile [path_to_deeprvat]/pipelines/association_testing_pretrained.snakefile ``` -Replace `[path_to_deeprvat]` with the path to your clone of the repository. -Again, note that the example data is randomly generated, and so is only suited for testing whether the `deeprvat` package has been correctly installed. +### Run association testing using REGENIE on precomputed burdens + +```shell +mkdir deeprvat_associate_regenie +cd deeprvat_associate_regenie +ln -s [path_to_deeprvat]/example/* . +ln -s precomputed_burdens/burdens.zarr . +snakemake -j 1 --snakefile [path_to_deeprvat]/pipelines/association_testing_pretrained_regenie.snakefile +``` diff --git a/docs/seed_gene_discovery.md b/docs/seed_gene_discovery.md index c0fefda9..84519c6f 100644 --- a/docs/seed_gene_discovery.md +++ b/docs/seed_gene_discovery.md @@ -21,17 +21,17 @@ The `annotations.parquet` data frame should have the following columns: - Consequence_missense_variant: - MAF: Maximum of the MAF in the UK Biobank cohort and in gnomAD release 3.0 (non-Finnish European population) can also be changed by using the --maf-column {maf_col_name} flag for the rule config and replacing MAF in the config.yaml with the {maf_col_name} but it must contain the string '_AF', '_MAF' OR '^MAF' -### Run the seed gene discovery pipeline with example data +## Configuration file -Create the conda environment and activate it, (instructions can be found here [DeepRVAT instructions](usage.md) ) +*TODO:* Describe `config.yaml`, give example +## Running the seed gene discovery pipeline + +In a directory with all of the [input data](#input-data) required and your [configuration file](#configuration-file) set up, run: + ``` -mkdir example -cd example -ln -s [path_to_deeprvat]/example/* . -cp [path_to_deeprvat]/deeprvat/seed_gene_discovery/config.yaml . -snakemake -j 1 --snakefile [path_to_deeprvat]/pipelines/seed_gene_discovery.snakefile +[path_to_deeprvat]/pipelines/seed_gene_discovery.snakefile ``` Replace `[path_to_deeprvat]` with the path to your clone of the repository. From 472dd450d0d4132402eeaa80c57e07b316680e5e Mon Sep 17 00:00:00 2001 From: Brian Clarke Date: Thu, 11 Apr 2024 16:53:30 +0200 Subject: [PATCH 03/15] remove separate config section --- docs/index.rst | 1 - 1 file changed, 1 deletion(-) diff --git a/docs/index.rst b/docs/index.rst index 362b0b26..ea9f40da 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -20,7 +20,6 @@ Rare variant association testing using deep learning and data-driven burden scor annotations.md seed_gene_discovery.md deeprvat.md - config.md ukbiobank.md apidocs/index From d6cc98291b0605abb14a18373609c700d99c2c31 Mon Sep 17 00:00:00 2001 From: Brian Clarke Date: Thu, 11 Apr 2024 16:57:40 +0200 Subject: [PATCH 04/15] corrections --- docs/deeprvat.md | 17 ++++++++++++++--- docs/seed_gene_discovery.md | 2 +- 2 files changed, 15 insertions(+), 4 deletions(-) diff --git a/docs/deeprvat.md b/docs/deeprvat.md index 145c16c7..55237985 100644 --- a/docs/deeprvat.md +++ b/docs/deeprvat.md @@ -3,6 +3,9 @@ *TODO:* Overview of procedure, multiple flavors (with training, with pretrained models, with precomputed burdens, with/without REGENIE) +## Input data: Common requirements for all pipelines + + ## Configuration file: Common parameters *TODO:* Describe common parameters, give example @@ -10,6 +13,8 @@ ## Training +### Input data + ### Configuration file ### Running the training pipeline @@ -17,21 +22,27 @@ ## Association testing +### Input data + ### Configuration file ### Running the association testing pipeline with REGENIE ### Running the association testing pipeline with SEAK +### Association testing using precomputed burdens + +*TODO:* With and without REGENIE ## Training and association testing with a combined pipeline -### Configuration file +### Input data and configuration file -### Running the training and association testing pipeline with REGENIE +*TODO:* Everything required for each separate one descibed above -### Running the training and association testing pipeline with SEAK +### Running the training and association testing pipeline +*TODO:* With and without REGENIE ## Running only a portion of any pipeline diff --git a/docs/seed_gene_discovery.md b/docs/seed_gene_discovery.md index 84519c6f..038f4e4b 100644 --- a/docs/seed_gene_discovery.md +++ b/docs/seed_gene_discovery.md @@ -6,7 +6,7 @@ To run the pipeline, an experiment directory with the `config.yaml` has to be cr ## Input data -The experiment directory in addition requires to have the same input data as specified for [DeepRVAT](usage.md), including +The experiment directory in addition requires to have the same input data as specified for [DeepRVAT](deeprvat.md), including - `annotations.parquet` - `protein_coding_genes.parquet` - `genotypes.h5` From e42fef1fae2d77def830f64b11787d0e44daf685 Mon Sep 17 00:00:00 2001 From: Brian Clarke Date: Wed, 17 Apr 2024 11:59:40 +0200 Subject: [PATCH 05/15] add skeleton for UK Biobank analysis --- ukbiobank.md | 9 +++++++++ 1 file changed, 9 insertions(+) create mode 100644 ukbiobank.md diff --git a/ukbiobank.md b/ukbiobank.md new file mode 100644 index 00000000..07333e8d --- /dev/null +++ b/ukbiobank.md @@ -0,0 +1,9 @@ +# UK Biobank analysis + +This section will be filled with content soon! + +## First steps + +## Basic analysis: Using precomputed burdens + +## Advanced analysis: Custom-trained DeepRVAT model From 9c5b541c00c071e49353df57d6b51d1923f95680 Mon Sep 17 00:00:00 2001 From: Kayla Meyer Date: Fri, 26 Apr 2024 15:43:42 +0200 Subject: [PATCH 06/15] populating deeprvat read-the-docs --- docs/deeprvat.md | 115 ++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 105 insertions(+), 10 deletions(-) diff --git a/docs/deeprvat.md b/docs/deeprvat.md index 55237985..5d14b9e8 100644 --- a/docs/deeprvat.md +++ b/docs/deeprvat.md @@ -1,49 +1,144 @@ # Training and association testing with DeepRVAT +We have developed multiple flavors of running DeepRVAT to suite your needs. Below lists various running setups that entail just training DeepRVAT, using pretrained DeepRVAT models for association testing, using precomputed burdens for association testing, including REGENIE in training and association testing and also combinations of these scenarios. The general procedure is to have the relevant input data for a given setup appropriately prepared, which may include having already completed the [preprocessing pipeline](https://deeprvat.readthedocs.io/en/latest/preprocessing.html) and [annotation pipeline](https://deeprvat.readthedocs.io/en/latest/annotations.html). -*TODO:* Overview of procedure, multiple flavors (with training, with pretrained models, with precomputed burdens, with/without REGENIE) +*TODO* also add CV training ? +## Installation +First the deepRVAT repository must be cloned in your `experiment` directory and the corresponding environment activated. Instructions are [here](installation.md) to setup the deepRVAT repository. ## Input data: Common requirements for all pipelines +An example overview of what your `experiment`` directory should contain can be seen here: +`[path_to_deeprvat]/example/` +Replace `[path_to_deeprvat]` with the path to your clone of the repository. +Note that the example data contained within the example directory is randomly generated, and is only suited for testing. -## Configuration file: Common parameters +- `genotypes.h5` +contains the *TODO* Which is an output from the preprocessing pipeline. Instructions [here](https://deeprvat.readthedocs.io/en/latest/preprocessing.html). + +- `variants.parquet` +contains the list of variants from the input vcf files, which is an output from the preprocessing pipeline. Instructions [here](https://deeprvat.readthedocs.io/en/latest/preprocessing.html). + +- `annotations.parquet` +contains all the variant annotations, which is an output from the annotation pipeline. Instructions [here](https://deeprvat.readthedocs.io/en/latest/annotations.html). + +- `protein_coding_genes.parquet` +contains the IDs to all the protein coding genes, which is an output from the annotation pipeline. Instructions [here](https://deeprvat.readthedocs.io/en/latest/annotations.html). + +- `config.yaml` +contains the configuration parameters for setting phenotypes, training data, model, training, and association data variables. +- `phenotypes.parquet` +contains the *TODO* + +- `[path_to_deeprvat]/example/baseline_results` +directory containing the results of the seed gene discovery pipline. Insturctions [here](seed_gene_discovery.md) + +## Configuration file: Common parameters *TODO:* Describe common parameters, give example +The `config.yaml` file located in your `experiment` directory contains the configuration parameters of key sections: phenotypes, baseline_results, training_data, and data. + +`config['phenotypes]` should consist of a complete list of phenotypes. To adjust only those phenotypes that should be used in training, add the phenotype names as a list under `config['training']['phenotypes']`. The phenotypes that are not listed under `config['training']['phenotypes']`, but are listed under +`config['phenotypes]` will subsequently be used only for association testing. + +*TODO* baseline results + +`config['training_data']` contains the relevant specifications for the training dataset creation. + +`config['data']` contains the relevant specifications for the association dataset creation. ## Training +To run only the training stage of DeepRVAT, comprised of training data creation and running the deepRVAT model, we have setup a training pipeline. ### Input data +The following files should be contained within your `experiment` directory: +- `config.yaml` +- `genotypes.h5` +- `variants.parquet` +- `annotations.parquet` +- `phenotypes.parquet` +- `protein_coding_genes.parquet` +- `baseline_results` directory ### Configuration file ### Running the training pipeline - +```shell +cd experiment +snakemake -j 1 --snakefile [path_to_deeprvat]/pipelines/run_training.snakefile +``` ## Association testing +If you already have a pretrained DeepRVAT model, we have setup pipelines for runing only the association testing stage. This includes creating the association dataset files, computing burdens, regression, and evaluation. ### Input data +The following files should be contained within your `experiment` directory: +- `config.yaml` +- `genotypes.h5` +- `variants.parquet` +- `annotations.parquet` +- `phenotypes.parquet` +- `protein_coding_genes.parquet` +- `baseline_results` directory ### Configuration file ### Running the association testing pipeline with REGENIE +*TODO* +For running with REGENIE, in addition the input data, the following REGENIE specific files should also be included in your `experiment` directory: +- `.sample` file containing the sample ID, genetic sex +- `.sniplist` file containing *TODO* +- `.bgen` +- `.bgen.bgi` + +For the REGENIE specific files, please refer to the [REGENIE documentation](https://rgcgithub.github.io/regenie/). + +```shell +cd experiment +ln -s [path_to_deeprvat]/pretrained_models +snakemake -j 1 --snakefile [path_to_deeprvat]/pipelines/association_testing_pretrained.snakefile +``` ### Running the association testing pipeline with SEAK +```shell +cd experiment +ln -s [path_to_deeprvat]/pretrained_models +snakemake -j 1 --snakefile [path_to_deeprvat]/pipelines/association_testing_pretrained.snakefile +``` ### Association testing using precomputed burdens - *TODO:* With and without REGENIE ## Training and association testing with a combined pipeline +To run the full pipeline from training through association testing, use the below procedure. This includes training and association testing dataset generation, deepRVAT model training, computation of burdens, regression and evaluation. ### Input data and configuration file - -*TODO:* Everything required for each separate one descibed above +The following files should be contained within your `experiment` directory: +- `config.yaml` +- `genotypes.h5` +- `variants.parquet` +- `annotations.parquet` +- `phenotypes.parquet` +- `protein_coding_genes.parquet` +- `baseline_results` directory ### Running the training and association testing pipeline - -*TODO:* With and without REGENIE +The process is as follows: +```shell +cd experiment +snakemake -j 1 --snakefile [path_to_deeprvat]/pipelines/training_association_testing.snakefile +``` + +#### Running with REGENIE +*TODO:* +```shell +cd experiment +snakemake -j 1 --snakefile [path_to_deeprvat]/pipelines/training_association_testing_regenie.snakefile +``` ## Running only a portion of any pipeline - -*TODO:* Point to modular breakdowns of pipelines +The snakemake pipelines outlined above are compromised of integrated common workflows. These smaller snakefiles which breakdown specific pipelines sections are in the following directories: +- `[path_to_deeprvat]/pipeline/association_testing` contains snakefiles breakingdown stages of the association testing. +- `[path_to_deeprvat]/pipeline/cv_training` contains snakefiles used to run training in a cross-validation setup. +- `[path_to_deeprvat]/pipeline/training` contains snakefiles used in setting up deepRVAT training. From 8834ebfa1c9498fa968deb59fb2133e35445f34b Mon Sep 17 00:00:00 2001 From: Eva Holtkamp Date: Tue, 14 May 2024 16:03:35 +0200 Subject: [PATCH 07/15] update deeprvat docs --- docs/deeprvat.md | 167 ++++++++++++++++++++++++++++++------ docs/seed_gene_discovery.md | 9 +- 2 files changed, 146 insertions(+), 30 deletions(-) diff --git a/docs/deeprvat.md b/docs/deeprvat.md index 5d14b9e8..b80c2b59 100644 --- a/docs/deeprvat.md +++ b/docs/deeprvat.md @@ -1,55 +1,96 @@ # Training and association testing with DeepRVAT We have developed multiple flavors of running DeepRVAT to suite your needs. Below lists various running setups that entail just training DeepRVAT, using pretrained DeepRVAT models for association testing, using precomputed burdens for association testing, including REGENIE in training and association testing and also combinations of these scenarios. The general procedure is to have the relevant input data for a given setup appropriately prepared, which may include having already completed the [preprocessing pipeline](https://deeprvat.readthedocs.io/en/latest/preprocessing.html) and [annotation pipeline](https://deeprvat.readthedocs.io/en/latest/annotations.html). -*TODO* also add CV training ? ## Installation First the deepRVAT repository must be cloned in your `experiment` directory and the corresponding environment activated. Instructions are [here](installation.md) to setup the deepRVAT repository. ## Input data: Common requirements for all pipelines -An example overview of what your `experiment`` directory should contain can be seen here: +An example overview of what your `experiment` directory should contain can be seen here: `[path_to_deeprvat]/example/` Replace `[path_to_deeprvat]` with the path to your clone of the repository. Note that the example data contained within the example directory is randomly generated, and is only suited for testing. - `genotypes.h5` -contains the *TODO* Which is an output from the preprocessing pipeline. Instructions [here](https://deeprvat.readthedocs.io/en/latest/preprocessing.html). +contains the genotypes for all samples in a custom sparse format. The sample ids in the `sample` slot are the same as in the VCF files the `genotypes.h5` has been read from. +This is output by the preprocessing pipeline. Instructions [here](https://deeprvat.readthedocs.io/en/latest/preprocessing.html). - `variants.parquet` -contains the list of variants from the input vcf files, which is an output from the preprocessing pipeline. Instructions [here](https://deeprvat.readthedocs.io/en/latest/preprocessing.html). +contains variant characteristics (`chrom`, `pos`, `ref`, `alt`) and the assigned variant `id` for all unique variants in `genotypes.h5`. This +is output from the input vcf files using preprocessing pipeline. Instructions [here](https://deeprvat.readthedocs.io/en/latest/preprocessing.html). - `annotations.parquet` -contains all the variant annotations, which is an output from the annotation pipeline. Instructions [here](https://deeprvat.readthedocs.io/en/latest/annotations.html). +contains the variant annotations for all variants in `variants.parquet`, which is an output from the annotation pipeline. Each variant is identified by its `id`. Instructions [here](https://deeprvat.readthedocs.io/en/latest/annotations.html). - `protein_coding_genes.parquet` -contains the IDs to all the protein coding genes, which is an output from the annotation pipeline. Instructions [here](https://deeprvat.readthedocs.io/en/latest/annotations.html). +Maps the `gene_id` used in `annotations.parquet` to actual genes (EnsemblID and HGNC gene name). This is an output from the annotation pipeline. Instructions [here](https://deeprvat.readthedocs.io/en/latest/annotations.html). - `config.yaml` contains the configuration parameters for setting phenotypes, training data, model, training, and association data variables. -- `phenotypes.parquet` -contains the *TODO* +- `phenotypes.parquet` contains the measured phenotypes for all samples (see `[path_to_deeprvat]/example/`). The row index must be the sample id as strings (same ids as used in the VCF file) and the column names the phenotype name. Phenotypes can be quantitative or binary (0,1). Use `NA` for missing values. +Samples missing in `phenotypes.parquet` won't be used in DeepRVAT training/testing. The user must generate this file as it's not output by the preprocessing/annotation pipeline. +This file must also contain all covariates that should be used during training/association testing (e.g., genetic sex, age, genetic principal components). - `[path_to_deeprvat]/example/baseline_results` directory containing the results of the seed gene discovery pipline. Insturctions [here](seed_gene_discovery.md) ## Configuration file: Common parameters -*TODO:* Describe common parameters, give example -The `config.yaml` file located in your `experiment` directory contains the configuration parameters of key sections: phenotypes, baseline_results, training_data, and data. +The `config.yaml` file located in your `experiment` directory contains the configuration parameters of key sections: phenotypes, baseline_results, training_data, and data. It also allows to set many other configurations, such as the variant annotations + +`config['training_data']` contains the relevant specifications for the training dataset creation. + +`config['data']` contains the relevant specifications for the association dataset creation. + +### Baseline results +Specifies paths to results from various seed gene discovery method runs (Burden/SKAT test with pLoF and missense variants). When using the seed gene discovery pipeline provided with this package, simply link the directory as 'baseline_results' in the experiment directory without any further changes. + +If you want to provide custom baseline results (already combined across tests), store them like `baseline_results/{phenotype}/combined_baseline/eval/burden_associations.parquet` and set the `baseline_results` in the config to +``` +- base: baseline_results + type: combined_baseline +``` +Baseline files have to be provided for each `{phenotype}` in `config['training']['phenotypes']`. The `burden_associations.parquet` must have the columns `gene` (gene id as assigned in `protein_coding_genes.parquet`) and `pval` (see `[path_to_deeprvat]/example/baseline_results`). +*TODO* add that seed gene config can be set via the `config['phenotypes]` + + +### Phenotypes `config['phenotypes]` should consist of a complete list of phenotypes. To adjust only those phenotypes that should be used in training, add the phenotype names as a list under `config['training']['phenotypes']`. The phenotypes that are not listed under `config['training']['phenotypes']`, but are listed under `config['phenotypes]` will subsequently be used only for association testing. +All phenotypes listed either in `config['phenotypes]` or `config['training']['phenotypes']` have to be in the column names of `phenotypes.parquet`. -*TODO* baseline results -`config['training_data']` contains the relevant specifications for the training dataset creation. +### Customizing the input data via the config file -`config['data']` contains the relevant specifications for the association dataset creation. +#### Data transformation + +The pipeline supports z-score standardization (`standardize`) and quantile transformation (`quantile_transform`) as transformation to of the target phenotypes. It has to be set in `config[key]['dataset_config']['y_transformation]`, where `key` is `training_data` or `data` to transform the training data and association testing data, respectively. + +For the annotations and the covariates, we allow standardization via `config[key]['dataset_config']['standardize_xpheno'] = True` (default = True) and `config[key]['dataset_config']['standardize_anno'] = True` (default = False). + +If custom transformations are whished, we recommend to replace the respective columns in `phenotypes.parquet` or `annotations.parquet` with the transformed values. + +#### Variant annotations +All variant anntations that should be included in DeepRVAT's variant annotation vectors have to be listed in `config[key]['dataset_config']['annotations']` and `config[key]['dataset_config']['rare_embedding']['config']['annotations']` (this will be simplified in future). Any annotation that is used for variant filtering `config[key]['dataset_config']['rare_embedding']['config']['thresholds']` also has to be included in `config[key]['dataset_config']['annotations']`. + +#### Variant minor allele frequency filter + +To set a threshold for variants with a MAF below a certain value (e.g., UKB_MAF < 0.1%), use: +`config[key]['dataset_config']['rare_embedding']['config']['thresholds']['UKB_MAF'] = "UKB_MAF < 1e-3"`. In this example, `UKB_MAF` represents the MAF column from annotations.parquet here denoting MAF in the UK Biobank. + +#### Additional variant filters +Additional variant filters can be added via `config[key]['dataset_config']['rare_embedding']['config']['thresholds'][{anno}] = "{anno} > X"`.For example `config['data]['dataset_config']['rare_embedding']['config']['thresholds']['CADD_PHRED'] = "CADD_PHRED > 5"` will only include variants with a CADD score > 5 during association testing. Mind that all annotations used in the `threshold` section also have to be listed in `config[key]['dataset_config']['annotations']`. + +#### subsetting samples +To specify a sample file for training or association testing, use: `config[key]['dataset_config']['sample_file]`. +Only `.pkl` files containing a list of sample IDs (string) are supported at the moment. +For example, if DeepRVAT training and association testing should be done on two separat datas sets, you can provide two sample files `training_samples.pkl` and `test_samples.pkl` via `config['training_data']['dataset_config']['sample_file] = training_samples.pkl` and `config['data']['dataset_config']['sample_file] = test_samples.pkl`. ## Training -To run only the training stage of DeepRVAT, comprised of training data creation and running the deepRVAT model, we have setup a training pipeline. +To run only the training stage of DeepRVAT, comprised of training data creation and running the DeepRVAT model, we have setup a training pipeline. ### Input data The following files should be contained within your `experiment` directory: @@ -59,9 +100,12 @@ The following files should be contained within your `experiment` directory: - `annotations.parquet` - `phenotypes.parquet` - `protein_coding_genes.parquet` -- `baseline_results` directory +- `baseline_results` directory where `[path_to_deeprvat]/pipelines/seed_gene_discovery.snakefile` has been run ### Configuration file +Changes to the model architecture and training parameters can be made via `config['training']`, `config['pl_trainer']`, `config['early_stopping']`, `config['model']`. +Per default, DeepRVAT scores are ensembled from 6 models. This can be changed via `config['n_repeats']`. + ### Running the training pipeline ```shell @@ -72,6 +116,7 @@ snakemake -j 1 --snakefile [path_to_deeprvat]/pipelines/run_training.snakefile ## Association testing If you already have a pretrained DeepRVAT model, we have setup pipelines for runing only the association testing stage. This includes creating the association dataset files, computing burdens, regression, and evaluation. + ### Input data The following files should be contained within your `experiment` directory: - `config.yaml` @@ -80,27 +125,72 @@ The following files should be contained within your `experiment` directory: - `annotations.parquet` - `phenotypes.parquet` - `protein_coding_genes.parquet` -- `baseline_results` directory ### Configuration file +The annotations in `config['data']['dataset_config']['rare_embedding']['config']['annotations']` must be the same (and in the same order) as in `config['data']['dataset_config']['rare_embedding']['config']['annotations']` from the pre-trained model. +If you use the pre-trained DeepRVAT model provided with this package, use `config['data']['dataset_config']['rare_embedding']['config']['annotations']` from the `[path_to_deeprvat]/example/config.yaml` to ensure the ordering of annotations is correct. ### Running the association testing pipeline with REGENIE -*TODO* + +#### Input data For running with REGENIE, in addition the input data, the following REGENIE specific files should also be included in your `experiment` directory: -- `.sample` file containing the sample ID, genetic sex -- `.sniplist` file containing *TODO* -- `.bgen` -- `.bgen.bgi` -For the REGENIE specific files, please refer to the [REGENIE documentation](https://rgcgithub.github.io/regenie/). -```shell +To run REGENIE Step 1 +- `.sample` Inclusion file that lists individuals to retain in the analysis +- `.sniplist` Inclusion file that lists IDs of variants to keep +- `.bgen` input genetic data file +- `.bgen.bgi` index bgi file corresponding to input BGEN file + +For these REGENIE specific files, please refer to the [REGENIE documentation](https://rgcgithub.github.io/regenie/). + +For running REGENIE Step 2: +- `gtf file` gencode gene annotation gtf file +- `keep_samples.txt` (optional file of samples to include) +- `protein_coding_genes.parquet` + +#### Config file + +Use the `[path_to_deeprvat]/example/config_regenie.yaml` as `config.yaml` which includes REGENIE specific parameters. +You can set any parameter explained in the [REGENIE documentation](https://rgcgithub.github.io/regenie/) via this config. +Most importantly, for association testing of binary traits use: +``` +step_2: + options: + - "--bt" + - "--firth --approx --pThresh 0.01" + +``` +and for quantitative traits: +``` +step_2: + options: + - "--qt" +``` + +#### Run REGENIE + + +``` cd experiment ln -s [path_to_deeprvat]/pretrained_models snakemake -j 1 --snakefile [path_to_deeprvat]/pipelines/association_testing_pretrained.snakefile ``` + +#### Testing multiple sub-chohorts +For testing multiple sub-cohorts, remember that REGENIE Step 1 (compute intense) only needs to be executed once per sample and phenotype. We suggest running REGENIE Step 1 on all samples and phenotypes initially and then linking the output as regenie_output/step1/ in each experiment directory for testing a sub-cohort. + +Samples to be considered when testing sub-cohorts can be provided via `keep_samples.txt` which look like + +``` +12345 12345 +56789 56789 +```` +for keeping two samples with ids `12345` and `56789` + ### Running the association testing pipeline with SEAK + ```shell cd experiment ln -s [path_to_deeprvat]/pretrained_models @@ -110,6 +200,36 @@ snakemake -j 1 --snakefile [path_to_deeprvat]/pipelines/association_testing_pret ### Association testing using precomputed burdens *TODO:* With and without REGENIE +## Training and association testing using cross-validation + +DeepRVAT offers a cv-like scheme, where it's trained on all samples except those in the held-out fold. Then, it computes gene impairment scores for the held-out samples using models that excluded them. This process repeats for all folds, providing DeepRVAT scores for all samples from models that hadn't seen them before. This is repeated for all folds, yielding DeepRVAT scores for all samples. + +### Input data and configuration file +The following files should be contained within your `experiment` directory: +- `config.yaml` +- `genotypes.h5` +- `variants.parquet` +- `annotations.parquet` +- `phenotypes.parquet` +- `protein_coding_genes.parquet` +- `baseline_results` directory +- `sample_files` provides training and test samples for each cross-validation fold as pickle files. + +### Config and sample files +For running 5-fold cross-validation include the following configuration in the config: +``` +cv_path: sample_files +n_folds: 5 +``` +Provide sample files structured as sample_files/5_fold/samples_{split}{fold}.pkl, where {split} represents train/test and {fold} is a number from 0 to 4. + +### Run the pipeline +```shell +cd experiment +snakemake -j 1 --snakefile [path_to_deeprvat]/pipelines/cv_training/cv_training_association_testing.snakefile +``` + + ## Training and association testing with a combined pipeline To run the full pipeline from training through association testing, use the below procedure. This includes training and association testing dataset generation, deepRVAT model training, computation of burdens, regression and evaluation. @@ -131,7 +251,6 @@ snakemake -j 1 --snakefile [path_to_deeprvat]/pipelines/training_association_tes ``` #### Running with REGENIE -*TODO:* ```shell cd experiment snakemake -j 1 --snakefile [path_to_deeprvat]/pipelines/training_association_testing_regenie.snakefile diff --git a/docs/seed_gene_discovery.md b/docs/seed_gene_discovery.md index 038f4e4b..ccb80445 100644 --- a/docs/seed_gene_discovery.md +++ b/docs/seed_gene_discovery.md @@ -13,13 +13,10 @@ The experiment directory in addition requires to have the same input data as spe - `variants.parquet` - `phenotypes.parquet` -The `annotations.parquet` data frame should have the following columns: +The `annotations.parquet` data frame output by the annotation pipeline can be used (although only the `Consequence_missense_variant` and `MAF` column will be used). One has to add a column `is_plof` to indicate whether a variant should is a loss of function variant (1) or not (0). We recommend to set this to `1` if any of the VEP consequences `["splice_acceptor_variant", "splice_donor_variant", "frameshift_variant", "stop_gained", "stop_lost", "start_lost"]` -- id (variant id, **should be the index column**) -- gene_ids (list) gene(s) the variant is assigned to -- is_plof (binary, indicating if the variant is loss of function) -- Consequence_missense_variant: -- MAF: Maximum of the MAF in the UK Biobank cohort and in gnomAD release 3.0 (non-Finnish European population) can also be changed by using the --maf-column {maf_col_name} flag for the rule config and replacing MAF in the config.yaml with the {maf_col_name} but it must contain the string '_AF', '_MAF' OR '^MAF' + +The `annotations.parquet` dataframe, generated by the annotation pipeline, can be utilized. To indicate if a variant is a loss of function (pLoF) variant, a column `is_plof` has to be added with values 0 or 1. We recommend to set this to `1` if the variant has been classified as any of these VEP consequences `["splice_acceptor_variant", "splice_donor_variant", "frameshift_variant", "stop_gained", "stop_lost", "start_lost"]`. ## Configuration file From 7b8b0aa662ff2bf87e0e88fa6b8fccbaaf0ba24c Mon Sep 17 00:00:00 2001 From: Eva Holtkamp Date: Tue, 14 May 2024 21:00:12 +0200 Subject: [PATCH 08/15] update config --- docs/deeprvat.md | 22 +++++++++--------- docs/seed_gene_discovery.md | 45 +++++++++++++++++++++++++++++++------ 2 files changed, 49 insertions(+), 18 deletions(-) diff --git a/docs/deeprvat.md b/docs/deeprvat.md index b80c2b59..014c08bd 100644 --- a/docs/deeprvat.md +++ b/docs/deeprvat.md @@ -33,19 +33,19 @@ contains the configuration parameters for setting phenotypes, training data, mod Samples missing in `phenotypes.parquet` won't be used in DeepRVAT training/testing. The user must generate this file as it's not output by the preprocessing/annotation pipeline. This file must also contain all covariates that should be used during training/association testing (e.g., genetic sex, age, genetic principal components). -- `[path_to_deeprvat]/example/baseline_results` +- `baseline_results` directory containing the results of the seed gene discovery pipline. Insturctions [here](seed_gene_discovery.md) ## Configuration file: Common parameters -The `config.yaml` file located in your `experiment` directory contains the configuration parameters of key sections: phenotypes, baseline_results, training_data, and data. It also allows to set many other configurations, such as the variant annotations +The `config.yaml` file located in your `experiment` directory contains the configuration parameters of key sections: phenotypes, baseline_results, training_data, and data. It also allows to set many other configurations detailed below. `config['training_data']` contains the relevant specifications for the training dataset creation. `config['data']` contains the relevant specifications for the association dataset creation. ### Baseline results -Specifies paths to results from various seed gene discovery method runs (Burden/SKAT test with pLoF and missense variants). When using the seed gene discovery pipeline provided with this package, simply link the directory as 'baseline_results' in the experiment directory without any further changes. +`config['baseline_results']` specifies paths to results from the seed gene discovery pipeline (Burden/SKAT test with pLoF and missense variants). When using the seed gene discovery pipeline provided with this package, simply link the directory as 'baseline_results' in the experiment directory without any further changes. If you want to provide custom baseline results (already combined across tests), store them like `baseline_results/{phenotype}/combined_baseline/eval/burden_associations.parquet` and set the `baseline_results` in the config to ``` @@ -58,7 +58,7 @@ Baseline files have to be provided for each `{phenotype}` in `config['training'] ### Phenotypes -`config['phenotypes]` should consist of a complete list of phenotypes. To adjust only those phenotypes that should be used in training, add the phenotype names as a list under `config['training']['phenotypes']`. The phenotypes that are not listed under `config['training']['phenotypes']`, but are listed under +`config['phenotypes]` should consist of a complete list of phenotypes. To change phenotypes used during training, use `config['training']['phenotypes']`. The phenotypes that are not listed under `config['training']['phenotypes']`, but are listed under `config['phenotypes]` will subsequently be used only for association testing. All phenotypes listed either in `config['phenotypes]` or `config['training']['phenotypes']` have to be in the column names of `phenotypes.parquet`. @@ -78,13 +78,13 @@ All variant anntations that should be included in DeepRVAT's variant annotation #### Variant minor allele frequency filter -To set a threshold for variants with a MAF below a certain value (e.g., UKB_MAF < 0.1%), use: +To filter for variants with a MAF below a certain value (e.g., UKB_MAF < 0.1%), use: `config[key]['dataset_config']['rare_embedding']['config']['thresholds']['UKB_MAF'] = "UKB_MAF < 1e-3"`. In this example, `UKB_MAF` represents the MAF column from annotations.parquet here denoting MAF in the UK Biobank. #### Additional variant filters Additional variant filters can be added via `config[key]['dataset_config']['rare_embedding']['config']['thresholds'][{anno}] = "{anno} > X"`.For example `config['data]['dataset_config']['rare_embedding']['config']['thresholds']['CADD_PHRED'] = "CADD_PHRED > 5"` will only include variants with a CADD score > 5 during association testing. Mind that all annotations used in the `threshold` section also have to be listed in `config[key]['dataset_config']['annotations']`. -#### subsetting samples +#### Subsetting samples To specify a sample file for training or association testing, use: `config[key]['dataset_config']['sample_file]`. Only `.pkl` files containing a list of sample IDs (string) are supported at the moment. For example, if DeepRVAT training and association testing should be done on two separat datas sets, you can provide two sample files `training_samples.pkl` and `test_samples.pkl` via `config['training_data']['dataset_config']['sample_file] = training_samples.pkl` and `config['data']['dataset_config']['sample_file] = test_samples.pkl`. @@ -133,7 +133,7 @@ If you use the pre-trained DeepRVAT model provided with this package, use `confi ### Running the association testing pipeline with REGENIE #### Input data -For running with REGENIE, in addition the input data, the following REGENIE specific files should also be included in your `experiment` directory: +For running with REGENIE, in addition to the default input data, the following REGENIE specific files should also be included in your `experiment` directory: To run REGENIE Step 1 @@ -179,7 +179,7 @@ snakemake -j 1 --snakefile [path_to_deeprvat]/pipelines/association_testing_pret #### Testing multiple sub-chohorts -For testing multiple sub-cohorts, remember that REGENIE Step 1 (compute intense) only needs to be executed once per sample and phenotype. We suggest running REGENIE Step 1 on all samples and phenotypes initially and then linking the output as regenie_output/step1/ in each experiment directory for testing a sub-cohort. +For testing multiple sub-cohorts, remember that REGENIE Step 1 (compute intense) only needs to be executed once per sample and phenotype. We suggest running REGENIE Step 1 on all samples and phenotypes initially and then linking the output as `regenie_output/step1/` in each experiment directory for testing a sub-cohort. Samples to be considered when testing sub-cohorts can be provided via `keep_samples.txt` which look like @@ -202,7 +202,7 @@ snakemake -j 1 --snakefile [path_to_deeprvat]/pipelines/association_testing_pret ## Training and association testing using cross-validation -DeepRVAT offers a cv-like scheme, where it's trained on all samples except those in the held-out fold. Then, it computes gene impairment scores for the held-out samples using models that excluded them. This process repeats for all folds, providing DeepRVAT scores for all samples from models that hadn't seen them before. This is repeated for all folds, yielding DeepRVAT scores for all samples. +DeepRVAT offers a cv-like scheme, where it's trained on all samples except those in the held-out fold. Then, it computes gene impairment scores for the held-out samples using models that excluded them. This is repeated for all folds, yielding DeepRVAT scores for all samples. ### Input data and configuration file The following files should be contained within your `experiment` directory: @@ -221,7 +221,7 @@ For running 5-fold cross-validation include the following configuration in the c cv_path: sample_files n_folds: 5 ``` -Provide sample files structured as sample_files/5_fold/samples_{split}{fold}.pkl, where {split} represents train/test and {fold} is a number from 0 to 4. +Provide sample files structured as `sample_files/5_fold/samples_{split}{fold}.pkl`, where `{split}` represents train/test and `{fold}` is a number from `0 to 4`. ### Run the pipeline ```shell @@ -231,7 +231,7 @@ snakemake -j 1 --snakefile [path_to_deeprvat]/pipelines/cv_training/cv_training_ ## Training and association testing with a combined pipeline -To run the full pipeline from training through association testing, use the below procedure. This includes training and association testing dataset generation, deepRVAT model training, computation of burdens, regression and evaluation. +To run the full pipeline from training through association testing, use the below procedure. This includes training and association testing dataset generation, DeepRVAT model training, computation of burdens, regression and evaluation. ### Input data and configuration file The following files should be contained within your `experiment` directory: diff --git a/docs/seed_gene_discovery.md b/docs/seed_gene_discovery.md index ccb80445..32282ec4 100644 --- a/docs/seed_gene_discovery.md +++ b/docs/seed_gene_discovery.md @@ -1,8 +1,6 @@ # Seed gene discovery -This pipeline discovers *seed genes* for DeepRVAT training. The pipeline runs SKAT and burden tests for missense and pLOF variants, weighting variants with Beta(MAF,1,25). To run the tests, we use the `Scoretest` from the [SEAK](https://github.com/HealthML/seak) package (has to be installed from github). - -To run the pipeline, an experiment directory with the `config.yaml` has to be created. An `lsf.yaml` file specifiying the compute resources for each rule in `seed_gene_discovery.snakefile` might also be needed depending on your system (see as an example the `lsf.yaml` file in this directory). +This pipeline discovers *seed genes* for DeepRVAT training. The pipeline runs SKAT and burden tests for missense and pLOF variants, weighting variants with Beta(MAF,1,25). To run the tests, we use the `Scoretest` from the [SEAK](https://github.com/HealthML/seak) package. ## Input data @@ -12,16 +10,49 @@ The experiment directory in addition requires to have the same input data as spe - `genotypes.h5` - `variants.parquet` - `phenotypes.parquet` - -The `annotations.parquet` data frame output by the annotation pipeline can be used (although only the `Consequence_missense_variant` and `MAF` column will be used). One has to add a column `is_plof` to indicate whether a variant should is a loss of function variant (1) or not (0). We recommend to set this to `1` if any of the VEP consequences `["splice_acceptor_variant", "splice_donor_variant", "frameshift_variant", "stop_gained", "stop_lost", "start_lost"]` - +- `config.yaml` (use `[path_to_deeprvat]/deeprvat/seed_gene_discovery/config.yaml` as a template) The `annotations.parquet` dataframe, generated by the annotation pipeline, can be utilized. To indicate if a variant is a loss of function (pLoF) variant, a column `is_plof` has to be added with values 0 or 1. We recommend to set this to `1` if the variant has been classified as any of these VEP consequences `["splice_acceptor_variant", "splice_donor_variant", "frameshift_variant", "stop_gained", "stop_lost", "start_lost"]`. ## Configuration file -*TODO:* Describe `config.yaml`, give example +You can restrict to only missense variants (identified by the `Consequence_missense_variant` column in `annotations.parquet` ) or pLoF variants (`is_plof` column) via +``` +variant_types: + - missense + - plof +``` +and specify the test types that will be run via +``` +test_types: + - skat + - burden +``` +. + +The minor allele frequency threshold is set via +``` +rare_maf: 0.001 +``` + +You can specify further test details in the test config using the following parameters: + +- `center_genotype` center the genotype matrix (True or False) +- `neglect_homozygous` Should the genotype value for homozyoogus variants be 1 (True) or 2 (False) +- `collapse_method` Burden test collapsing method. Supported are `sum` and `max` +- `var_weight` Variant weighting function. Supported are `beta_maf` (Beta(MAF, 1, 25)) or `sift_polpyen` (mean of 1-SIFT and Polyphen2 score) +- `min_mac` minimum expected allele count for genes to be included. This is the cumulative allele frequency of variants in the burden mask (e.g., pLoF variants) for a given gene (e.g. pLoF variants) multiplied by the cohort size or number of cases for quantitative and binary traits, respectively. + +``` +test_config: + center_genotype: True + neglect_homozygous: False + collapse_method: sum #collapsing method for burden, + var_weight_function: beta_maf + min_mac: 50 # minimum expected allel count + +``` ## Running the seed gene discovery pipeline From 30891e9cb1866d34d21b47f7114a161ca7b6e031 Mon Sep 17 00:00:00 2001 From: Eva Holtkamp Date: Tue, 14 May 2024 21:05:31 +0200 Subject: [PATCH 09/15] fix references --- docs/index.rst | 2 +- docs/seed_gene_discovery.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/index.rst b/docs/index.rst index ea9f40da..5cf9044a 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -20,7 +20,7 @@ Rare variant association testing using deep learning and data-driven burden scor annotations.md seed_gene_discovery.md deeprvat.md - ukbiobank.md + .. ukbiobank.md apidocs/index diff --git a/docs/seed_gene_discovery.md b/docs/seed_gene_discovery.md index 32282ec4..d60d6021 100644 --- a/docs/seed_gene_discovery.md +++ b/docs/seed_gene_discovery.md @@ -56,7 +56,7 @@ test_config: ## Running the seed gene discovery pipeline -In a directory with all of the [input data](#input-data) required and your [configuration file](#configuration-file) set up, run: +In a directory with all of the [input data](##input-data) required and your [configuration file](##configuration-file) set up, run: ``` [path_to_deeprvat]/pipelines/seed_gene_discovery.snakefile From 3b533b7ada64c4216d15e968fb961e6979a0b5c2 Mon Sep 17 00:00:00 2001 From: Eva Holtkamp Date: Tue, 14 May 2024 21:07:59 +0200 Subject: [PATCH 10/15] adding missing ukbiobank.md --- docs/index.rst | 2 +- docs/ukbiobank.md | 3 +++ 2 files changed, 4 insertions(+), 1 deletion(-) create mode 100644 docs/ukbiobank.md diff --git a/docs/index.rst b/docs/index.rst index 5cf9044a..ea9f40da 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -20,7 +20,7 @@ Rare variant association testing using deep learning and data-driven burden scor annotations.md seed_gene_discovery.md deeprvat.md - .. ukbiobank.md + ukbiobank.md apidocs/index diff --git a/docs/ukbiobank.md b/docs/ukbiobank.md new file mode 100644 index 00000000..3a81ad8e --- /dev/null +++ b/docs/ukbiobank.md @@ -0,0 +1,3 @@ +# Applying DeepRVAT to UK Biobank data + +*TODO* \ No newline at end of file From c7e9dae339d69760874926215c7fd8a907139e1b Mon Sep 17 00:00:00 2001 From: Magnus Wahlberg Date: Wed, 15 May 2024 14:56:59 +0200 Subject: [PATCH 11/15] Fix missing cross-reference targets --- docs/seed_gene_discovery.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/seed_gene_discovery.md b/docs/seed_gene_discovery.md index d60d6021..138019f2 100644 --- a/docs/seed_gene_discovery.md +++ b/docs/seed_gene_discovery.md @@ -2,6 +2,7 @@ This pipeline discovers *seed genes* for DeepRVAT training. The pipeline runs SKAT and burden tests for missense and pLOF variants, weighting variants with Beta(MAF,1,25). To run the tests, we use the `Scoretest` from the [SEAK](https://github.com/HealthML/seak) package. +(input-data)= ## Input data The experiment directory in addition requires to have the same input data as specified for [DeepRVAT](deeprvat.md), including @@ -14,6 +15,7 @@ The experiment directory in addition requires to have the same input data as spe The `annotations.parquet` dataframe, generated by the annotation pipeline, can be utilized. To indicate if a variant is a loss of function (pLoF) variant, a column `is_plof` has to be added with values 0 or 1. We recommend to set this to `1` if the variant has been classified as any of these VEP consequences `["splice_acceptor_variant", "splice_donor_variant", "frameshift_variant", "stop_gained", "stop_lost", "start_lost"]`. +(configuration-file)= ## Configuration file You can restrict to only missense variants (identified by the `Consequence_missense_variant` column in `annotations.parquet` ) or pLoF variants (`is_plof` column) via @@ -56,7 +58,7 @@ test_config: ## Running the seed gene discovery pipeline -In a directory with all of the [input data](##input-data) required and your [configuration file](##configuration-file) set up, run: +In a directory with all of the [input data](#input-data) required and your [configuration file](#configuration-file) set up, run: ``` [path_to_deeprvat]/pipelines/seed_gene_discovery.snakefile From 2ffe0b235b528326601f8180b623da290fb7df59 Mon Sep 17 00:00:00 2001 From: Magnus Wahlberg Date: Wed, 15 May 2024 15:08:23 +0200 Subject: [PATCH 12/15] typo --- docs/seed_gene_discovery.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/seed_gene_discovery.md b/docs/seed_gene_discovery.md index 138019f2..cd38f834 100644 --- a/docs/seed_gene_discovery.md +++ b/docs/seed_gene_discovery.md @@ -58,7 +58,7 @@ test_config: ## Running the seed gene discovery pipeline -In a directory with all of the [input data](#input-data) required and your [configuration file](#configuration-file) set up, run: +In a directory with all the [input data](#input-data) required and your [configuration file](#configuration-file) set up, run: ``` [path_to_deeprvat]/pipelines/seed_gene_discovery.snakefile From 4cb1a4e767b545ade8b4b2011267d6eb4a0dc370 Mon Sep 17 00:00:00 2001 From: Brian Clarke Date: Thu, 16 May 2024 16:20:58 +0200 Subject: [PATCH 13/15] various updates --- docs/deeprvat.md | 100 +++++++++++++++++++++--------------- docs/index.rst | 41 ++++++++++++++- docs/installation.md | 2 +- docs/preprocessing.md | 6 +-- docs/quickstart.md | 40 ++++++++------- docs/seed_gene_discovery.md | 5 +- docs/ukbiobank.md | 8 ++- 7 files changed, 133 insertions(+), 69 deletions(-) diff --git a/docs/deeprvat.md b/docs/deeprvat.md index 014c08bd..a15656fb 100644 --- a/docs/deeprvat.md +++ b/docs/deeprvat.md @@ -1,11 +1,11 @@ # Training and association testing with DeepRVAT -We have developed multiple flavors of running DeepRVAT to suite your needs. Below lists various running setups that entail just training DeepRVAT, using pretrained DeepRVAT models for association testing, using precomputed burdens for association testing, including REGENIE in training and association testing and also combinations of these scenarios. The general procedure is to have the relevant input data for a given setup appropriately prepared, which may include having already completed the [preprocessing pipeline](https://deeprvat.readthedocs.io/en/latest/preprocessing.html) and [annotation pipeline](https://deeprvat.readthedocs.io/en/latest/annotations.html). +We have developed multiple modes of running DeepRVAT to suit your needs. Below are listed various running setups that entail just training DeepRVAT, using pretrained DeepRVAT models for association testing, using precomputed burdens for association testing, including REGENIE in training and association testing and also combinations of these scenarios. The general procedure is to have the relevant input data for a given setup appropriately prepared, which may include having already completed the [preprocessing pipeline](https://deeprvat.readthedocs.io/en/latest/preprocessing.html) and [annotation pipeline](https://deeprvat.readthedocs.io/en/latest/annotations.html). -## Installation -First the deepRVAT repository must be cloned in your `experiment` directory and the corresponding environment activated. Instructions are [here](installation.md) to setup the deepRVAT repository. +(common requirements for input data)= ## Input data: Common requirements for all pipelines + An example overview of what your `experiment` directory should contain can be seen here: `[path_to_deeprvat]/example/` @@ -13,18 +13,18 @@ Replace `[path_to_deeprvat]` with the path to your clone of the repository. Note that the example data contained within the example directory is randomly generated, and is only suited for testing. - `genotypes.h5` -contains the genotypes for all samples in a custom sparse format. The sample ids in the `sample` slot are the same as in the VCF files the `genotypes.h5` has been read from. +contains the genotypes for all samples in a custom sparse format. The sample ids in the `samples` dataset are the same as in the VCF files the `genotypes.h5` has been read from. This is output by the preprocessing pipeline. Instructions [here](https://deeprvat.readthedocs.io/en/latest/preprocessing.html). - `variants.parquet` contains variant characteristics (`chrom`, `pos`, `ref`, `alt`) and the assigned variant `id` for all unique variants in `genotypes.h5`. This -is output from the input vcf files using preprocessing pipeline. Instructions [here](https://deeprvat.readthedocs.io/en/latest/preprocessing.html). +is output from the input VCF files using the preprocessing pipeline. Instructions [here](https://deeprvat.readthedocs.io/en/latest/preprocessing.html). - `annotations.parquet` contains the variant annotations for all variants in `variants.parquet`, which is an output from the annotation pipeline. Each variant is identified by its `id`. Instructions [here](https://deeprvat.readthedocs.io/en/latest/annotations.html). - `protein_coding_genes.parquet` -Maps the `gene_id` used in `annotations.parquet` to actual genes (EnsemblID and HGNC gene name). This is an output from the annotation pipeline. Instructions [here](https://deeprvat.readthedocs.io/en/latest/annotations.html). +Maps the integer `gene_id` used in `annotations.parquet` to standard gene IDs (EnsemblID and HGNC gene name). This is an output from the annotation pipeline. Instructions [here](https://deeprvat.readthedocs.io/en/latest/annotations.html). - `config.yaml` contains the configuration parameters for setting phenotypes, training data, model, training, and association data variables. @@ -36,6 +36,8 @@ This file must also contain all covariates that should be used during training/a - `baseline_results` directory containing the results of the seed gene discovery pipline. Insturctions [here](seed_gene_discovery.md) + +(common configuration parameters)= ## Configuration file: Common parameters The `config.yaml` file located in your `experiment` directory contains the configuration parameters of key sections: phenotypes, baseline_results, training_data, and data. It also allows to set many other configurations detailed below. @@ -54,13 +56,13 @@ If you want to provide custom baseline results (already combined across tests), ``` Baseline files have to be provided for each `{phenotype}` in `config['training']['phenotypes']`. The `burden_associations.parquet` must have the columns `gene` (gene id as assigned in `protein_coding_genes.parquet`) and `pval` (see `[path_to_deeprvat]/example/baseline_results`). -*TODO* add that seed gene config can be set via the `config['phenotypes]` + ### Phenotypes `config['phenotypes]` should consist of a complete list of phenotypes. To change phenotypes used during training, use `config['training']['phenotypes']`. The phenotypes that are not listed under `config['training']['phenotypes']`, but are listed under `config['phenotypes]` will subsequently be used only for association testing. -All phenotypes listed either in `config['phenotypes]` or `config['training']['phenotypes']` have to be in the column names of `phenotypes.parquet`. +All phenotypes listed either in `config['phenotypes']` or `config['training']['phenotypes']` have to be in the column names of `phenotypes.parquet`. ### Customizing the input data via the config file @@ -79,41 +81,25 @@ All variant anntations that should be included in DeepRVAT's variant annotation #### Variant minor allele frequency filter To filter for variants with a MAF below a certain value (e.g., UKB_MAF < 0.1%), use: -`config[key]['dataset_config']['rare_embedding']['config']['thresholds']['UKB_MAF'] = "UKB_MAF < 1e-3"`. In this example, `UKB_MAF` represents the MAF column from annotations.parquet here denoting MAF in the UK Biobank. +`config[key]['dataset_config']['rare_embedding']['config']['thresholds']['UKB_MAF'] = "UKB_MAF < 1e-3"`. In this example, `UKB_MAF` represents the MAF column from `annotations.parquet` here denoting MAF in the UK Biobank. #### Additional variant filters -Additional variant filters can be added via `config[key]['dataset_config']['rare_embedding']['config']['thresholds'][{anno}] = "{anno} > X"`.For example `config['data]['dataset_config']['rare_embedding']['config']['thresholds']['CADD_PHRED'] = "CADD_PHRED > 5"` will only include variants with a CADD score > 5 during association testing. Mind that all annotations used in the `threshold` section also have to be listed in `config[key]['dataset_config']['annotations']`. +Additional variant filters can be added via `config[key]['dataset_config']['rare_embedding']['config']['thresholds'][{anno}] = "{anno} > X"`. For example, `config['data]['dataset_config']['rare_embedding']['config']['thresholds']['CADD_PHRED'] = "CADD_PHRED > 5"` will only include variants with a CADD score > 5 during association testing. Mind that all annotations used in the `threshold` section also have to be listed in `config[key]['dataset_config']['annotations']`. #### Subsetting samples -To specify a sample file for training or association testing, use: `config[key]['dataset_config']['sample_file]`. +To specify a sample file for training or association testing, use: `config[key]['dataset_config']['sample_file']`. Only `.pkl` files containing a list of sample IDs (string) are supported at the moment. -For example, if DeepRVAT training and association testing should be done on two separat datas sets, you can provide two sample files `training_samples.pkl` and `test_samples.pkl` via `config['training_data']['dataset_config']['sample_file] = training_samples.pkl` and `config['data']['dataset_config']['sample_file] = test_samples.pkl`. +For example, if DeepRVAT training and association testing should be done on two separate data sets, you can provide two sample files `training_samples.pkl` and `test_samples.pkl` via `config['training_data']['dataset_config']['sample_file] = training_samples.pkl` and `config['data']['dataset_config']['sample_file] = test_samples.pkl`. -## Training -To run only the training stage of DeepRVAT, comprised of training data creation and running the DeepRVAT model, we have setup a training pipeline. - -### Input data -The following files should be contained within your `experiment` directory: -- `config.yaml` -- `genotypes.h5` -- `variants.parquet` -- `annotations.parquet` -- `phenotypes.parquet` -- `protein_coding_genes.parquet` -- `baseline_results` directory where `[path_to_deeprvat]/pipelines/seed_gene_discovery.snakefile` has been run +## Association testing using precomputed burdens -### Configuration file -Changes to the model architecture and training parameters can be made via `config['training']`, `config['pl_trainer']`, `config['early_stopping']`, `config['model']`. -Per default, DeepRVAT scores are ensembled from 6 models. This can be changed via `config['n_repeats']`. +_Coming soon_ + -### Running the training pipeline -```shell -cd experiment -snakemake -j 1 --snakefile [path_to_deeprvat]/pipelines/run_training.snakefile -``` +(Association_testing)= +## Association testing using pretrained models -## Association testing If you already have a pretrained DeepRVAT model, we have setup pipelines for runing only the association testing stage. This includes creating the association dataset files, computing burdens, regression, and evaluation. @@ -132,6 +118,10 @@ If you use the pre-trained DeepRVAT model provided with this package, use `confi ### Running the association testing pipeline with REGENIE +_Coming soon_ + + + +## Training +To run only the training stage of DeepRVAT, comprised of training data creation and running the DeepRVAT model, we have setup a training pipeline. + +### Input data +The following files should be contained within your `experiment` directory: +- `config.yaml` +- `genotypes.h5` +- `variants.parquet` +- `annotations.parquet` +- `phenotypes.parquet` +- `protein_coding_genes.parquet` +- `baseline_results` directory where `[path_to_deeprvat]/pipelines/seed_gene_discovery.snakefile` has been run + +### Configuration file +Changes to the model architecture and training parameters can be made via `config['training']`, `config['pl_trainer']`, `config['early_stopping']`, `config['model']`. +Per default, DeepRVAT scores are ensembled from 6 models. This can be changed via `config['n_repeats']`. + + +### Running the training pipeline +```shell +cd experiment +snakemake -j 1 --snakefile [path_to_deeprvat]/pipelines/run_training.snakefile +``` + ## Training and association testing using cross-validation -DeepRVAT offers a cv-like scheme, where it's trained on all samples except those in the held-out fold. Then, it computes gene impairment scores for the held-out samples using models that excluded them. This is repeated for all folds, yielding DeepRVAT scores for all samples. +DeepRVAT offers a CV scheme, where it's trained on all samples except those in the held-out fold. Then, it computes gene impairment scores for the held-out samples using models that excluded them. This is repeated for all folds, yielding DeepRVAT scores for all samples. ### Input data and configuration file The following files should be contained within your `experiment` directory: @@ -229,9 +243,10 @@ cd experiment snakemake -j 1 --snakefile [path_to_deeprvat]/pipelines/cv_training/cv_training_association_testing.snakefile ``` + + ## Running only a portion of any pipeline The snakemake pipelines outlined above are compromised of integrated common workflows. These smaller snakefiles which breakdown specific pipelines sections are in the following directories: -- `[path_to_deeprvat]/pipeline/association_testing` contains snakefiles breakingdown stages of the association testing. +- `[path_to_deeprvat]/pipeline/association_testing` contains snakefiles breaking down stages of the association testing. - `[path_to_deeprvat]/pipeline/cv_training` contains snakefiles used to run training in a cross-validation setup. - `[path_to_deeprvat]/pipeline/training` contains snakefiles used in setting up deepRVAT training. diff --git a/docs/index.rst b/docs/index.rst index ea9f40da..81a226ce 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -4,9 +4,45 @@ contain the root `toctree` directive. Welcome to DeepRVAT's documentation! +====================================== + +Rare variant association testing using deep learning and data-driven burden scores. + + +How to use this documentation +=================================== + +A good place to start is in :doc:`Basic usage `, to install the package and make sure it runs correctly. + +To run DeepRVAT on your data, first consult *Modes of usage* :doc:`here `, then proceed based on which mode is right for your use case. + +For all modes, you'll want to consult *Input data: Common requirements for all pipelines* and *Configuration file: Common parameters* :doc:`here `. + +For all modes of usage other than association testing with precomputed burdens, you'll need to :doc:`preprocess ` your genotype data, followed by :doc:`annotating ` your variants. + +To train custom DeepRVAT models, rather than using precomputed burdens or our provided pretrained models, you'll need to additionally run :doc:`seed gene discovery `. + +Finally, consult the relevant section for your use case :doc:`here `. + +If running DeepRVAT on a cluster (recommended), some helpful tips are :doc:`here `. + + +Citation ==================================== -Rare variant association testing using deep learning and data-driven burden scores +If you use this package, please cite: + +Clarke, Holtkamp et al., “Integration of Variant Annotations Using Deep Set Networks Boosts Rare Variant Association Genetics.” bioRxiv. https://dx.doi.org/10.1101/2023.07.12.548506 + + +Contact +==================================== + +To report a bug or make a feature request, please create an `issue `_ on GitHub. + +| For general inquiries, please contact: +| brian.clarke@dkfz.de +| eva.holtkamp@cit.tum.de .. toctree:: @@ -15,11 +51,12 @@ Rare variant association testing using deep learning and data-driven burden scor installation.md quickstart.md - general.md preprocessing.md annotations.md seed_gene_discovery.md deeprvat.md + cluster.md + practical.md ukbiobank.md apidocs/index diff --git a/docs/installation.md b/docs/installation.md index a3e61d8c..66c2f91c 100644 --- a/docs/installation.md +++ b/docs/installation.md @@ -7,7 +7,7 @@ git clone git@github.com:PMBio/deeprvat.git 1. Change directory to the repository: `cd deeprvat` 1. Install the conda environment. We recommend using [mamba](https://mamba.readthedocs.io/en/latest/index.html), though you may also replace `mamba` with `conda` - *Note: [the current deeprvat env does not support cuda when installed with conda](https://github.com/PMBio/deeprvat/issues/16), install using mamba for cuda support.* + *Note: [the current deeprvat env does not support cuda when installed with conda](https://github.com/PMBio/deeprvat/issues/16). Install using mamba for cuda support.* ```shell mamba env create -n deeprvat -f deeprvat_env.yaml ``` diff --git a/docs/preprocessing.md b/docs/preprocessing.md index c9c3bc79..0457d9f8 100644 --- a/docs/preprocessing.md +++ b/docs/preprocessing.md @@ -1,7 +1,7 @@ -# DeepRVAT Preprocessing pipeline +# DeepRVAT preprocessing pipeline -The DeepRVAT preprocessing pipeline is based on [snakemake](https://snakemake.readthedocs.io/en/stable/) it uses -[bcftools+samstools](https://www.htslib.org/) and a [python script](https://github.com/PMBio/deeprvat/blob/main/deeprvat/preprocessing/preprocess.py) preprocessing.py. +The DeepRVAT preprocessing pipeline is based on [snakemake](https://snakemake.readthedocs.io/en/stable/). It uses +[bcftools+samtools](https://www.htslib.org/) and a [python script](https://github.com/PMBio/deeprvat/blob/main/deeprvat/preprocessing/preprocess.py) preprocessing.py. ![DeepRVAT preprocessing pipeline](_static/preprocess_no_qc_rulegraph.svg) diff --git a/docs/quickstart.md b/docs/quickstart.md index dec65a71..cbc798f1 100644 --- a/docs/quickstart.md +++ b/docs/quickstart.md @@ -1,5 +1,9 @@ # Basic usage +## Install the package + +Instructions [here](installation.md) + ## Customize pipelines Before running any of the snakefiles, you may want to adjust the number of threads used by different steps in the pipeline. To do this, modify the `threads:` property of a given rule. @@ -24,43 +28,43 @@ In each case, replace `[path_to_deeprvat]` with the path to your clone of the re Note that the example data used here is randomly generated, and so is only suited for testing whether the `deeprvat` package has been correctly installed. -### Run the full training and association testing pipeline on some example data +### Run the association testing pipeline with pretrained models ```shell -mkdir deeprvat_train_associate -cd deeprvat_train_associate +mkdir deeprvat_associate +cd deeprvat_associate ln -s [path_to_deeprvat]/example/* . -snakemake -j 1 --snakefile [path_to_deeprvat]/pipelines/training_association_testing.snakefile +ln -s [path_to_deeprvat]/pretrained_models +snakemake -j 1 --snakefile [path_to_deeprvat]/pipelines/association_testing_pretrained.snakefile ``` -### Run the training pipeline on some example data +### Run association testing using REGENIE on precomputed burdens ```shell -mkdir deeprvat_train -cd deeprvat_train +mkdir deeprvat_associate_regenie +cd deeprvat_associate_regenie ln -s [path_to_deeprvat]/example/* . -snakemake -j 1 --snakefile [path_to_deeprvat]/pipelines/run_training.snakefile +ln -s precomputed_burdens/burdens.zarr . +snakemake -j 1 --snakefile [path_to_deeprvat]/pipelines/association_testing_pretrained_regenie.snakefile ``` -### Run the association testing pipeline with pretrained models +### Run the training pipeline on some example data ```shell -mkdir deeprvat_associate -cd deeprvat_associate +mkdir deeprvat_train +cd deeprvat_train ln -s [path_to_deeprvat]/example/* . -ln -s [path_to_deeprvat]/pretrained_models -snakemake -j 1 --snakefile [path_to_deeprvat]/pipelines/association_testing_pretrained.snakefile +snakemake -j 1 --snakefile [path_to_deeprvat]/pipelines/run_training.snakefile ``` -### Run association testing using REGENIE on precomputed burdens +### Run the full training and association testing pipeline on some example data ```shell -mkdir deeprvat_associate_regenie -cd deeprvat_associate_regenie +mkdir deeprvat_train_associate +cd deeprvat_train_associate ln -s [path_to_deeprvat]/example/* . -ln -s precomputed_burdens/burdens.zarr . -snakemake -j 1 --snakefile [path_to_deeprvat]/pipelines/association_testing_pretrained_regenie.snakefile +snakemake -j 1 --snakefile [path_to_deeprvat]/pipelines/training_association_testing.snakefile ``` diff --git a/docs/seed_gene_discovery.md b/docs/seed_gene_discovery.md index 257bfbe9..3fb8ca97 100644 --- a/docs/seed_gene_discovery.md +++ b/docs/seed_gene_discovery.md @@ -13,9 +13,9 @@ The experiment directory in addition requires to have the same input data as spe - `genotypes.h5` - `variants.parquet` - `phenotypes.parquet` -- `config.yaml` (use `[path_to_deeprvat]/deeprvat/seed_gene_discovery/config.yaml` as a template) +- `config.yaml` (use [this](https://github.com/PMBio/deeprvat/blob/main/deeprvat/seed_gene_discovery/config.yaml) as a template) -The `annotations.parquet` dataframe, generated by the annotation pipeline, can be utilized. To indicate if a variant is a loss of function (pLoF) variant, a column `is_plof` has to be added with values 0 or 1. We recommend to set this to `1` if the variant has been classified as any of these VEP consequences `["splice_acceptor_variant", "splice_donor_variant", "frameshift_variant", "stop_gained", "stop_lost", "start_lost"]`. +The `annotations.parquet` dataframe, generated by the annotation pipeline, can be utilized. To indicate if a variant is a loss of function (pLoF) variant, a column `is_plof` has to be added with values 0 or 1. We recommend to set this to `1` if the variant has been classified as any of these VEP consequences: `["splice_acceptor_variant", "splice_donor_variant", "frameshift_variant", "stop_gained", "stop_lost", "start_lost"]`. (configuration-file)= ## Configuration file @@ -32,7 +32,6 @@ test_types: - skat - burden ``` -. The minor allele frequency threshold is set via diff --git a/docs/ukbiobank.md b/docs/ukbiobank.md index 3a81ad8e..7c4987f3 100644 --- a/docs/ukbiobank.md +++ b/docs/ukbiobank.md @@ -1,3 +1,9 @@ # Applying DeepRVAT to UK Biobank data -*TODO* \ No newline at end of file +_Note: This section is coming soon!_ + +## First steps + +## Basic analysis: Using precomputed burdens + +## Advanced analysis: Custom-trained DeepRVAT model From e09b8d5e3a1b9994ac78782d39ca879b9fbda70c Mon Sep 17 00:00:00 2001 From: Brian Clarke Date: Thu, 16 May 2024 16:30:54 +0200 Subject: [PATCH 14/15] commit missing files --- docs/cluster.md | 17 +++++++++++++++++ docs/general.md | 18 ------------------ docs/practical.md | 48 +++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 65 insertions(+), 18 deletions(-) create mode 100644 docs/cluster.md delete mode 100644 docs/general.md create mode 100644 docs/practical.md diff --git a/docs/cluster.md b/docs/cluster.md new file mode 100644 index 00000000..9b60a105 --- /dev/null +++ b/docs/cluster.md @@ -0,0 +1,17 @@ +# Cluster execution + +## Pipeline resource requirements + +For cluster exectution, resource requirements are expected under `resources:` in all rules. All pipelines have some suggested resource requirements, but they may need to be adjusted for your data or cluster. + + +## Cluster execution + +If you are running on a computing cluster, you will need a [profile](https://github.com/snakemake-profiles). We have tested execution on LSF. If you run into issues running on other clusters, please [let us know](https://github.com/PMBio/deeprvat/issues). + + +## Execution on GPU vs. CPU + +Two steps in the pipelines use GPU by default: Training (rule `train` from [train.snakefile](https://github.com/PMBio/deeprvat/blob/main/pipelines/training/train.snakefile)) and burden computation (rule `compute_burdens` from [burdens.snakefile](https://github.com/PMBio/deeprvat/blob/main/pipelines/association_testing/burdens.snakefile)). To run on CPU on a computing cluster, you may need to remove the line `gpus = 1` from the `resources:` of those rules. + +Bear in mind that this will make burden computation substantially slower, but still feasible for most datasets. Training without GPU is not practical on large datasets such as UK Biobank. diff --git a/docs/general.md b/docs/general.md deleted file mode 100644 index bcc6e71d..00000000 --- a/docs/general.md +++ /dev/null @@ -1,18 +0,0 @@ -# General considerations - -## Pipeline resource requirements - -*TODO:* Note that pipelines have some suggested resource requirements, may need to be adjusted for cluster execution - - -## Cluster execution - -*TODO:* Point to snakemake profiles - - -## Execution on GPU vs. CPU - -*TODO:* Two rules that use GPU. Training pretty much requires GPU, burden computation is okay on CPU but substantially slower - - -## *TODO:* Add any other points? diff --git a/docs/practical.md b/docs/practical.md new file mode 100644 index 00000000..d0caf180 --- /dev/null +++ b/docs/practical.md @@ -0,0 +1,48 @@ +# Practical recommendations for users + + +## Modes of usage + +DeepRVAT can be applied in various modes, presented here in increasing levels of complexity. For each of these scenarios, we provide a corresponding Snakemake pipeline. + +### Precomputed burden scores + +_Note: Precomputed burden scores are not yet available. They will be made available upon publication of the DeepRVAT manuscript._ + +For users running association testing on UKBB WES data, we provide precomputed burden scores for all protein-coding genes with a qualifying variant within 300 bp of an exon. In this scenario, users are freed from processing of large WES data and may carry out highly computationally efficient association tests with the default DeepRVAT pipeline or the DeepRVAT+REGENIE integration. + +Note that DeepRVAT scores are on a scale between 0 and 1, with a score closer to 0 indicating that the aggregate effect of variants in the gene is protective, and a score closer to 1 when the aggregate effect is deleterious. + +### Pretrained models + +Some users may wish to select variants or make variant-to-gene assigments differently from our methods, or to work on datasets other than UKBB. For this, we provide an ensemble of pretrained DeepRVAT gene impairment modules, which can be used for scoring individual-gene pairs for subsequent association testing. We also provide a pipeline for functional annotation of variants for compatibility with the pretrained modules. + +### Model training + +Other users may wish to exert full control over DeepRVAT scores, for example, to modify the model architecture, the set of annotations, or the set of training traits. For this, we provide pipelines for gene impairment module training, both in our CV and in a standard training/validation setup, with subsequent gene impairment score computation and association testing. + + +## Gene impairment module training + +For users wishing to train a custom DeepRVAT model, we provide here some practical suggestions based on our experiences. + +### Model architecture + +We found no benefit to using architectures larger than that used in this work, though we conjecture that larger architectures may provide some benefit with larger training data and more annotations. We performed limited experimentation with the aggregation function used and found the maximum to give better results than the sum. However, exploring other choices or a learned aggregation remains open. + +### Training traits and seed genes + +We found that multiphenotype training improved performance, however, on our dataset, adding traits with fewer than three seed genes provided modest to no benefit. We also saw poor performance when including seed genes based on prior knowledge, e.g., known GWAS or RVAS associations, rather than the seed gene discovery methods. We hypothesize that this is because an informative seed gene must have driver rare variants in the training dataset itself, which may not be the case for associations known from other cohorts. + +### Variant selection + +While association testing was carried out on variants with MAF < 0.1%, we saw improved results when including a greater number of variants (we used MAF < 1%) for training. + +### Variant annotations + +We found that the best performance was achieved when including the full set of annotations, including correlated annotations. We thus recommend including annotations fairly liberally. However, we did find limits, for example, increasing the number of DeepSEA PCs from the 6 we used provided no benefit and eventually degraded model performance. + +### Model ensembling + +We found little to no benefit, but also no harm, from using more than 6 DeepRVAT gene impairment modules per CV fold in our ensemble. Therefore, we chose this number as the most computationally efficient to achieve optimal results. + From 8ae143298aec9a19a7fea82fa7c187d32d5f9025 Mon Sep 17 00:00:00 2001 From: Magnus Wahlberg Date: Thu, 16 May 2024 16:42:22 +0200 Subject: [PATCH 15/15] Typos --- docs/annotations.md | 10 +++++----- docs/deeprvat.md | 6 +++--- docs/practical.md | 2 +- 3 files changed, 9 insertions(+), 9 deletions(-) diff --git a/docs/annotations.md b/docs/annotations.md index 29e079fa..94957ca9 100644 --- a/docs/annotations.md +++ b/docs/annotations.md @@ -1,10 +1,10 @@ # DeepRVAT Annotation pipeline -This pipeline is based on [snakemake](https://snakemake.readthedocs.io/en/stable/). It uses [bcftools + samstools](https://www.htslib.org/), as well as [perl](https://www.perl.org/), [deepRiPe](https://ohlerlab.mdc-berlin.de/software/DeepRiPe_140/) and [deepSEA](http://deepsea.princeton.edu/) as well as [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html), including plugins for [primateAI](https://github.com/Illumina/PrimateAI) and [spliceAI](https://github.com/Illumina/SpliceAI). DeepRiPe annotations were acquired using [faatpipe repository by HealthML](https://github.com/HealthML/faatpipe)[[1]](#reference-1-target) and DeepSea annotations were calculated using [kipoi-veff2](https://github.com/kipoi/kipoi-veff2)[[2]](#reference-2-target), abSplice scores were computet using [abSplice](https://github.com/gagneurlab/absplice/)[[3]](#reference-3-target) +This pipeline is based on [snakemake](https://snakemake.readthedocs.io/en/stable/). It uses [bcftools + samstools](https://www.htslib.org/), as well as [perl](https://www.perl.org/), [deepRiPe](https://ohlerlab.mdc-berlin.de/software/DeepRiPe_140/) and [deepSEA](http://deepsea.princeton.edu/) as well as [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html), including plugins for [primateAI](https://github.com/Illumina/PrimateAI) and [spliceAI](https://github.com/Illumina/SpliceAI). DeepRiPe annotations were acquired using [faatpipe repository by HealthML](https://github.com/HealthML/faatpipe)[[1]](#reference-1-target) and DeepSea annotations were calculated using [kipoi-veff2](https://github.com/kipoi/kipoi-veff2)[[2]](#reference-2-target), abSplice scores were computed using [abSplice](https://github.com/gagneurlab/absplice/)[[3]](#reference-3-target) ![dag](_static/annotation_rulegraph.svg) -*Figure 1: Rulegraph of the annoation pipeline.* +*Figure 1: Rule graph of the annotation pipeline.* ## Output This pipeline outputs a parquet file including all annotations as well as a file containing IDs to all protein coding genes needed to run DeepRVAT. @@ -28,7 +28,7 @@ BCFtools as well as HTSlib should be installed on the machine, should be installed for runnning the pipeline, together with the [plugins](https://www.ensembl.org/info/docs/tools/vep/script/vep_plugins.html) for primateAI and spliceAI. Annotation data for CADD, spliceAI and primateAI should be downloaded. The path to the data may be specified in the corresponding [config file](https://github.com/PMBio/deeprvat/blob/main/pipelines/config/deeprvat_annotation_config.yaml). Download paths: -- [CADD](https://cadd.bihealth.org/download): "All possible SNVs of GRCh38/hg38" and "gnomad.genomes.r3.0.indel.tsv.gz" incl. their Tabix Indices +- [CADD](https://cadd.bihealth.org/download): "All possible SNVs of GRCh38/hg38" and "gnomad.genomes.r3.0.indel.tsv.gz" incl. their Tabix Indices - [SpliceAI](https://basespace.illumina.com/s/otSPW8hnhaZR): "genome_scores_v1.3"/"spliceai_scores.raw.snv.hg38.vcf.gz" and "spliceai_scores.raw.indel.hg38.vcf.gz" - [PrimateAI](https://basespace.illumina.com/s/yYGFdGih1rXL) PrimateAI supplementary data/"PrimateAI_scores_v0.2_GRCh38_sorted.tsv.bgz" - [AlphaMissense](https://storage.googleapis.com/dm_alphamissense/AlphaMissense_hg38.tsv.gz) @@ -80,7 +80,7 @@ The config above would use the following directory structure: Bcf files created by the [preprocessing pipeline](https://deeprvat.readthedocs.io/en/latest/preprocessing.html) are used as input data. The input data directory should only contain the files needed. -The pipeline also uses the variant.tsv file, the reference file and the genotypes file from the preprocesing pipeline. +The pipeline also uses the variant.tsv file, the reference file and the genotypes file from the preprocessing pipeline. A GTF file as described in [requirements](#requirements) and the FASTA file used for preprocessing is also necessary. The pipeline beginns by installing the repositories needed for the annotations, it will automatically install all repositories in the `repo_dir` folder that can be specified in the config file relative to the annotation working directory. The text file mapping blocks to chromosomes is stored in `metadata` folder. The output is stored in the `output_dir/annotations` folder and any temporary files in the `tmp` subfolder. All repositories used including VEP with its corresponding cache as well as plugins are stored in `repo_dir/ensempl-vep`. @@ -107,7 +107,7 @@ Data for VEP plugins and the CADD cache are stored in `annotation data`. ### Running the pipeline -This pipeline should be run after running the [preprocessing pipeline](https://deeprvat.readthedocs.io/en/latest/preprocessing.html), since it relies on some of its outpur files (specifically the bcf files in `norm/bcf/`, the variant files in `norm/variants/` and the genotype file `preprocessed/genotypes.h5` +This pipeline should be run after running the [preprocessing pipeline](https://deeprvat.readthedocs.io/en/latest/preprocessing.html), since it relies on some of its output files (specifically the bcf files in `norm/bcf/`, the variant files in `norm/variants/` and the genotype file `preprocessed/genotypes.h5` After configuration and activating the `deeprvat_annotations` environment run the pipeline using snakemake: diff --git a/docs/deeprvat.md b/docs/deeprvat.md index a15656fb..476563bf 100644 --- a/docs/deeprvat.md +++ b/docs/deeprvat.md @@ -100,7 +100,7 @@ _Coming soon_ (Association_testing)= ## Association testing using pretrained models -If you already have a pretrained DeepRVAT model, we have setup pipelines for runing only the association testing stage. This includes creating the association dataset files, computing burdens, regression, and evaluation. +If you already have a pretrained DeepRVAT model, we have setup pipelines for running only the association testing stage. This includes creating the association dataset files, computing burdens, regression, and evaluation. ### Input data @@ -190,7 +190,7 @@ snakemake -j 1 --snakefile [path_to_deeprvat]/pipelines/association_testing_pret ---> ## Training -To run only the training stage of DeepRVAT, comprised of training data creation and running the DeepRVAT model, we have setup a training pipeline. +To run only the training stage of DeepRVAT, comprised of training data creation and running the DeepRVAT model, we have set up a training pipeline. ### Input data The following files should be contained within your `experiment` directory: @@ -258,7 +258,7 @@ The following files should be contained within your `experiment` directory: - `protein_coding_genes.parquet` - `baseline_results` directory -### Running the training and association testing pipelinewith SEAK +### Running the training and association testing pipeline with SEAK ```shell cd experiment diff --git a/docs/practical.md b/docs/practical.md index d0caf180..0f827970 100644 --- a/docs/practical.md +++ b/docs/practical.md @@ -15,7 +15,7 @@ Note that DeepRVAT scores are on a scale between 0 and 1, with a score closer to ### Pretrained models -Some users may wish to select variants or make variant-to-gene assigments differently from our methods, or to work on datasets other than UKBB. For this, we provide an ensemble of pretrained DeepRVAT gene impairment modules, which can be used for scoring individual-gene pairs for subsequent association testing. We also provide a pipeline for functional annotation of variants for compatibility with the pretrained modules. +Some users may wish to select variants or make variant-to-gene assignments differently from our methods, or to work on datasets other than UKBB. For this, we provide an ensemble of pretrained DeepRVAT gene impairment modules, which can be used for scoring individual-gene pairs for subsequent association testing. We also provide a pipeline for functional annotation of variants for compatibility with the pretrained modules. ### Model training