diff --git a/docs/annotations.md b/docs/annotations.md index 29e079fa..94957ca9 100644 --- a/docs/annotations.md +++ b/docs/annotations.md @@ -1,10 +1,10 @@ # DeepRVAT Annotation pipeline -This pipeline is based on [snakemake](https://snakemake.readthedocs.io/en/stable/). It uses [bcftools + samstools](https://www.htslib.org/), as well as [perl](https://www.perl.org/), [deepRiPe](https://ohlerlab.mdc-berlin.de/software/DeepRiPe_140/) and [deepSEA](http://deepsea.princeton.edu/) as well as [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html), including plugins for [primateAI](https://github.com/Illumina/PrimateAI) and [spliceAI](https://github.com/Illumina/SpliceAI). DeepRiPe annotations were acquired using [faatpipe repository by HealthML](https://github.com/HealthML/faatpipe)[[1]](#reference-1-target) and DeepSea annotations were calculated using [kipoi-veff2](https://github.com/kipoi/kipoi-veff2)[[2]](#reference-2-target), abSplice scores were computet using [abSplice](https://github.com/gagneurlab/absplice/)[[3]](#reference-3-target) +This pipeline is based on [snakemake](https://snakemake.readthedocs.io/en/stable/). It uses [bcftools + samstools](https://www.htslib.org/), as well as [perl](https://www.perl.org/), [deepRiPe](https://ohlerlab.mdc-berlin.de/software/DeepRiPe_140/) and [deepSEA](http://deepsea.princeton.edu/) as well as [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html), including plugins for [primateAI](https://github.com/Illumina/PrimateAI) and [spliceAI](https://github.com/Illumina/SpliceAI). DeepRiPe annotations were acquired using [faatpipe repository by HealthML](https://github.com/HealthML/faatpipe)[[1]](#reference-1-target) and DeepSea annotations were calculated using [kipoi-veff2](https://github.com/kipoi/kipoi-veff2)[[2]](#reference-2-target), abSplice scores were computed using [abSplice](https://github.com/gagneurlab/absplice/)[[3]](#reference-3-target) ![dag](_static/annotation_rulegraph.svg) -*Figure 1: Rulegraph of the annoation pipeline.* +*Figure 1: Rule graph of the annotation pipeline.* ## Output This pipeline outputs a parquet file including all annotations as well as a file containing IDs to all protein coding genes needed to run DeepRVAT. @@ -28,7 +28,7 @@ BCFtools as well as HTSlib should be installed on the machine, should be installed for runnning the pipeline, together with the [plugins](https://www.ensembl.org/info/docs/tools/vep/script/vep_plugins.html) for primateAI and spliceAI. Annotation data for CADD, spliceAI and primateAI should be downloaded. The path to the data may be specified in the corresponding [config file](https://github.com/PMBio/deeprvat/blob/main/pipelines/config/deeprvat_annotation_config.yaml). Download paths: -- [CADD](https://cadd.bihealth.org/download): "All possible SNVs of GRCh38/hg38" and "gnomad.genomes.r3.0.indel.tsv.gz" incl. their Tabix Indices +- [CADD](https://cadd.bihealth.org/download): "All possible SNVs of GRCh38/hg38" and "gnomad.genomes.r3.0.indel.tsv.gz" incl. their Tabix Indices - [SpliceAI](https://basespace.illumina.com/s/otSPW8hnhaZR): "genome_scores_v1.3"/"spliceai_scores.raw.snv.hg38.vcf.gz" and "spliceai_scores.raw.indel.hg38.vcf.gz" - [PrimateAI](https://basespace.illumina.com/s/yYGFdGih1rXL) PrimateAI supplementary data/"PrimateAI_scores_v0.2_GRCh38_sorted.tsv.bgz" - [AlphaMissense](https://storage.googleapis.com/dm_alphamissense/AlphaMissense_hg38.tsv.gz) @@ -80,7 +80,7 @@ The config above would use the following directory structure: Bcf files created by the [preprocessing pipeline](https://deeprvat.readthedocs.io/en/latest/preprocessing.html) are used as input data. The input data directory should only contain the files needed. -The pipeline also uses the variant.tsv file, the reference file and the genotypes file from the preprocesing pipeline. +The pipeline also uses the variant.tsv file, the reference file and the genotypes file from the preprocessing pipeline. A GTF file as described in [requirements](#requirements) and the FASTA file used for preprocessing is also necessary. The pipeline beginns by installing the repositories needed for the annotations, it will automatically install all repositories in the `repo_dir` folder that can be specified in the config file relative to the annotation working directory. The text file mapping blocks to chromosomes is stored in `metadata` folder. The output is stored in the `output_dir/annotations` folder and any temporary files in the `tmp` subfolder. All repositories used including VEP with its corresponding cache as well as plugins are stored in `repo_dir/ensempl-vep`. @@ -107,7 +107,7 @@ Data for VEP plugins and the CADD cache are stored in `annotation data`. ### Running the pipeline -This pipeline should be run after running the [preprocessing pipeline](https://deeprvat.readthedocs.io/en/latest/preprocessing.html), since it relies on some of its outpur files (specifically the bcf files in `norm/bcf/`, the variant files in `norm/variants/` and the genotype file `preprocessed/genotypes.h5` +This pipeline should be run after running the [preprocessing pipeline](https://deeprvat.readthedocs.io/en/latest/preprocessing.html), since it relies on some of its output files (specifically the bcf files in `norm/bcf/`, the variant files in `norm/variants/` and the genotype file `preprocessed/genotypes.h5` After configuration and activating the `deeprvat_annotations` environment run the pipeline using snakemake: diff --git a/docs/cluster.md b/docs/cluster.md new file mode 100644 index 00000000..9b60a105 --- /dev/null +++ b/docs/cluster.md @@ -0,0 +1,17 @@ +# Cluster execution + +## Pipeline resource requirements + +For cluster exectution, resource requirements are expected under `resources:` in all rules. All pipelines have some suggested resource requirements, but they may need to be adjusted for your data or cluster. + + +## Cluster execution + +If you are running on a computing cluster, you will need a [profile](https://github.com/snakemake-profiles). We have tested execution on LSF. If you run into issues running on other clusters, please [let us know](https://github.com/PMBio/deeprvat/issues). + + +## Execution on GPU vs. CPU + +Two steps in the pipelines use GPU by default: Training (rule `train` from [train.snakefile](https://github.com/PMBio/deeprvat/blob/main/pipelines/training/train.snakefile)) and burden computation (rule `compute_burdens` from [burdens.snakefile](https://github.com/PMBio/deeprvat/blob/main/pipelines/association_testing/burdens.snakefile)). To run on CPU on a computing cluster, you may need to remove the line `gpus = 1` from the `resources:` of those rules. + +Bear in mind that this will make burden computation substantially slower, but still feasible for most datasets. Training without GPU is not practical on large datasets such as UK Biobank. diff --git a/docs/deeprvat.md b/docs/deeprvat.md new file mode 100644 index 00000000..476563bf --- /dev/null +++ b/docs/deeprvat.md @@ -0,0 +1,281 @@ +# Training and association testing with DeepRVAT + +We have developed multiple modes of running DeepRVAT to suit your needs. Below are listed various running setups that entail just training DeepRVAT, using pretrained DeepRVAT models for association testing, using precomputed burdens for association testing, including REGENIE in training and association testing and also combinations of these scenarios. The general procedure is to have the relevant input data for a given setup appropriately prepared, which may include having already completed the [preprocessing pipeline](https://deeprvat.readthedocs.io/en/latest/preprocessing.html) and [annotation pipeline](https://deeprvat.readthedocs.io/en/latest/annotations.html). + + +(common requirements for input data)= +## Input data: Common requirements for all pipelines + +An example overview of what your `experiment` directory should contain can be seen here: +`[path_to_deeprvat]/example/` + +Replace `[path_to_deeprvat]` with the path to your clone of the repository. +Note that the example data contained within the example directory is randomly generated, and is only suited for testing. + +- `genotypes.h5` +contains the genotypes for all samples in a custom sparse format. The sample ids in the `samples` dataset are the same as in the VCF files the `genotypes.h5` has been read from. +This is output by the preprocessing pipeline. Instructions [here](https://deeprvat.readthedocs.io/en/latest/preprocessing.html). + +- `variants.parquet` +contains variant characteristics (`chrom`, `pos`, `ref`, `alt`) and the assigned variant `id` for all unique variants in `genotypes.h5`. This +is output from the input VCF files using the preprocessing pipeline. Instructions [here](https://deeprvat.readthedocs.io/en/latest/preprocessing.html). + +- `annotations.parquet` +contains the variant annotations for all variants in `variants.parquet`, which is an output from the annotation pipeline. Each variant is identified by its `id`. Instructions [here](https://deeprvat.readthedocs.io/en/latest/annotations.html). + +- `protein_coding_genes.parquet` +Maps the integer `gene_id` used in `annotations.parquet` to standard gene IDs (EnsemblID and HGNC gene name). This is an output from the annotation pipeline. Instructions [here](https://deeprvat.readthedocs.io/en/latest/annotations.html). + +- `config.yaml` +contains the configuration parameters for setting phenotypes, training data, model, training, and association data variables. + +- `phenotypes.parquet` contains the measured phenotypes for all samples (see `[path_to_deeprvat]/example/`). The row index must be the sample id as strings (same ids as used in the VCF file) and the column names the phenotype name. Phenotypes can be quantitative or binary (0,1). Use `NA` for missing values. +Samples missing in `phenotypes.parquet` won't be used in DeepRVAT training/testing. The user must generate this file as it's not output by the preprocessing/annotation pipeline. +This file must also contain all covariates that should be used during training/association testing (e.g., genetic sex, age, genetic principal components). + +- `baseline_results` +directory containing the results of the seed gene discovery pipline. Insturctions [here](seed_gene_discovery.md) + + +(common configuration parameters)= +## Configuration file: Common parameters + +The `config.yaml` file located in your `experiment` directory contains the configuration parameters of key sections: phenotypes, baseline_results, training_data, and data. It also allows to set many other configurations detailed below. + +`config['training_data']` contains the relevant specifications for the training dataset creation. + +`config['data']` contains the relevant specifications for the association dataset creation. + +### Baseline results +`config['baseline_results']` specifies paths to results from the seed gene discovery pipeline (Burden/SKAT test with pLoF and missense variants). When using the seed gene discovery pipeline provided with this package, simply link the directory as 'baseline_results' in the experiment directory without any further changes. + +If you want to provide custom baseline results (already combined across tests), store them like `baseline_results/{phenotype}/combined_baseline/eval/burden_associations.parquet` and set the `baseline_results` in the config to +``` +- base: baseline_results + type: combined_baseline +``` +Baseline files have to be provided for each `{phenotype}` in `config['training']['phenotypes']`. The `burden_associations.parquet` must have the columns `gene` (gene id as assigned in `protein_coding_genes.parquet`) and `pval` (see `[path_to_deeprvat]/example/baseline_results`). + + + + +### Phenotypes +`config['phenotypes]` should consist of a complete list of phenotypes. To change phenotypes used during training, use `config['training']['phenotypes']`. The phenotypes that are not listed under `config['training']['phenotypes']`, but are listed under +`config['phenotypes]` will subsequently be used only for association testing. +All phenotypes listed either in `config['phenotypes']` or `config['training']['phenotypes']` have to be in the column names of `phenotypes.parquet`. + + +### Customizing the input data via the config file + +#### Data transformation + +The pipeline supports z-score standardization (`standardize`) and quantile transformation (`quantile_transform`) as transformation to of the target phenotypes. It has to be set in `config[key]['dataset_config']['y_transformation]`, where `key` is `training_data` or `data` to transform the training data and association testing data, respectively. + +For the annotations and the covariates, we allow standardization via `config[key]['dataset_config']['standardize_xpheno'] = True` (default = True) and `config[key]['dataset_config']['standardize_anno'] = True` (default = False). + +If custom transformations are whished, we recommend to replace the respective columns in `phenotypes.parquet` or `annotations.parquet` with the transformed values. + +#### Variant annotations +All variant anntations that should be included in DeepRVAT's variant annotation vectors have to be listed in `config[key]['dataset_config']['annotations']` and `config[key]['dataset_config']['rare_embedding']['config']['annotations']` (this will be simplified in future). Any annotation that is used for variant filtering `config[key]['dataset_config']['rare_embedding']['config']['thresholds']` also has to be included in `config[key]['dataset_config']['annotations']`. + +#### Variant minor allele frequency filter + +To filter for variants with a MAF below a certain value (e.g., UKB_MAF < 0.1%), use: +`config[key]['dataset_config']['rare_embedding']['config']['thresholds']['UKB_MAF'] = "UKB_MAF < 1e-3"`. In this example, `UKB_MAF` represents the MAF column from `annotations.parquet` here denoting MAF in the UK Biobank. + +#### Additional variant filters +Additional variant filters can be added via `config[key]['dataset_config']['rare_embedding']['config']['thresholds'][{anno}] = "{anno} > X"`. For example, `config['data]['dataset_config']['rare_embedding']['config']['thresholds']['CADD_PHRED'] = "CADD_PHRED > 5"` will only include variants with a CADD score > 5 during association testing. Mind that all annotations used in the `threshold` section also have to be listed in `config[key]['dataset_config']['annotations']`. + +#### Subsetting samples +To specify a sample file for training or association testing, use: `config[key]['dataset_config']['sample_file']`. +Only `.pkl` files containing a list of sample IDs (string) are supported at the moment. +For example, if DeepRVAT training and association testing should be done on two separate data sets, you can provide two sample files `training_samples.pkl` and `test_samples.pkl` via `config['training_data']['dataset_config']['sample_file] = training_samples.pkl` and `config['data']['dataset_config']['sample_file] = test_samples.pkl`. + +## Association testing using precomputed burdens + +_Coming soon_ + + + +(Association_testing)= +## Association testing using pretrained models + +If you already have a pretrained DeepRVAT model, we have setup pipelines for running only the association testing stage. This includes creating the association dataset files, computing burdens, regression, and evaluation. + + +### Input data +The following files should be contained within your `experiment` directory: +- `config.yaml` +- `genotypes.h5` +- `variants.parquet` +- `annotations.parquet` +- `phenotypes.parquet` +- `protein_coding_genes.parquet` + +### Configuration file +The annotations in `config['data']['dataset_config']['rare_embedding']['config']['annotations']` must be the same (and in the same order) as in `config['data']['dataset_config']['rare_embedding']['config']['annotations']` from the pre-trained model. +If you use the pre-trained DeepRVAT model provided with this package, use `config['data']['dataset_config']['rare_embedding']['config']['annotations']` from the `[path_to_deeprvat]/example/config.yaml` to ensure the ordering of annotations is correct. + +### Running the association testing pipeline with REGENIE + +_Coming soon_ + + + +## Training +To run only the training stage of DeepRVAT, comprised of training data creation and running the DeepRVAT model, we have set up a training pipeline. + +### Input data +The following files should be contained within your `experiment` directory: +- `config.yaml` +- `genotypes.h5` +- `variants.parquet` +- `annotations.parquet` +- `phenotypes.parquet` +- `protein_coding_genes.parquet` +- `baseline_results` directory where `[path_to_deeprvat]/pipelines/seed_gene_discovery.snakefile` has been run + +### Configuration file +Changes to the model architecture and training parameters can be made via `config['training']`, `config['pl_trainer']`, `config['early_stopping']`, `config['model']`. +Per default, DeepRVAT scores are ensembled from 6 models. This can be changed via `config['n_repeats']`. + + +### Running the training pipeline +```shell +cd experiment +snakemake -j 1 --snakefile [path_to_deeprvat]/pipelines/run_training.snakefile +``` + + +## Training and association testing using cross-validation + +DeepRVAT offers a CV scheme, where it's trained on all samples except those in the held-out fold. Then, it computes gene impairment scores for the held-out samples using models that excluded them. This is repeated for all folds, yielding DeepRVAT scores for all samples. + +### Input data and configuration file +The following files should be contained within your `experiment` directory: +- `config.yaml` +- `genotypes.h5` +- `variants.parquet` +- `annotations.parquet` +- `phenotypes.parquet` +- `protein_coding_genes.parquet` +- `baseline_results` directory +- `sample_files` provides training and test samples for each cross-validation fold as pickle files. + +### Config and sample files +For running 5-fold cross-validation include the following configuration in the config: +``` +cv_path: sample_files +n_folds: 5 +``` +Provide sample files structured as `sample_files/5_fold/samples_{split}{fold}.pkl`, where `{split}` represents train/test and `{fold}` is a number from `0 to 4`. + +### Run the pipeline +```shell +cd experiment +snakemake -j 1 --snakefile [path_to_deeprvat]/pipelines/cv_training/cv_training_association_testing.snakefile +``` + + + +## Running only a portion of any pipeline +The snakemake pipelines outlined above are compromised of integrated common workflows. These smaller snakefiles which breakdown specific pipelines sections are in the following directories: +- `[path_to_deeprvat]/pipeline/association_testing` contains snakefiles breaking down stages of the association testing. +- `[path_to_deeprvat]/pipeline/cv_training` contains snakefiles used to run training in a cross-validation setup. +- `[path_to_deeprvat]/pipeline/training` contains snakefiles used in setting up deepRVAT training. diff --git a/docs/index.rst b/docs/index.rst index fcabaffc..81a226ce 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -4,19 +4,60 @@ contain the root `toctree` directive. Welcome to DeepRVAT's documentation! +====================================== + +Rare variant association testing using deep learning and data-driven burden scores. + + +How to use this documentation +=================================== + +A good place to start is in :doc:`Basic usage `, to install the package and make sure it runs correctly. + +To run DeepRVAT on your data, first consult *Modes of usage* :doc:`here `, then proceed based on which mode is right for your use case. + +For all modes, you'll want to consult *Input data: Common requirements for all pipelines* and *Configuration file: Common parameters* :doc:`here `. + +For all modes of usage other than association testing with precomputed burdens, you'll need to :doc:`preprocess ` your genotype data, followed by :doc:`annotating ` your variants. + +To train custom DeepRVAT models, rather than using precomputed burdens or our provided pretrained models, you'll need to additionally run :doc:`seed gene discovery `. + +Finally, consult the relevant section for your use case :doc:`here `. + +If running DeepRVAT on a cluster (recommended), some helpful tips are :doc:`here `. + + +Citation ==================================== -Rare variant association testing using deep learning and data-driven burden scores +If you use this package, please cite: + +Clarke, Holtkamp et al., “Integration of Variant Annotations Using Deep Set Networks Boosts Rare Variant Association Genetics.” bioRxiv. https://dx.doi.org/10.1101/2023.07.12.548506 + + +Contact +==================================== + +To report a bug or make a feature request, please create an `issue `_ on GitHub. + +| For general inquiries, please contact: +| brian.clarke@dkfz.de +| eva.holtkamp@cit.tum.de .. toctree:: :maxdepth: 2 :caption: Contents: - usage.md + installation.md + quickstart.md preprocessing.md annotations.md seed_gene_discovery.md + deeprvat.md + cluster.md + practical.md + ukbiobank.md apidocs/index diff --git a/docs/installation.md b/docs/installation.md new file mode 100644 index 00000000..66c2f91c --- /dev/null +++ b/docs/installation.md @@ -0,0 +1,21 @@ +# Installation + +1. Clone this repository: +```shell +git clone git@github.com:PMBio/deeprvat.git +``` +1. Change directory to the repository: `cd deeprvat` +1. Install the conda environment. We recommend using [mamba](https://mamba.readthedocs.io/en/latest/index.html), though you may also replace `mamba` with `conda` + + *Note: [the current deeprvat env does not support cuda when installed with conda](https://github.com/PMBio/deeprvat/issues/16). Install using mamba for cuda support.* +```shell +mamba env create -n deeprvat -f deeprvat_env.yaml +``` +1. Activate the environment: `mamba activate deeprvat` +1. Install the `deeprvat` package: `pip install -e .` + +If you don't want to install the GPU-related requirements, use the `deeprvat_env_no_gpu.yml` environment instead. +```shell +mamba env create -n deeprvat -f deeprvat_env_no_gpu.yaml +``` + diff --git a/docs/practical.md b/docs/practical.md new file mode 100644 index 00000000..0f827970 --- /dev/null +++ b/docs/practical.md @@ -0,0 +1,48 @@ +# Practical recommendations for users + + +## Modes of usage + +DeepRVAT can be applied in various modes, presented here in increasing levels of complexity. For each of these scenarios, we provide a corresponding Snakemake pipeline. + +### Precomputed burden scores + +_Note: Precomputed burden scores are not yet available. They will be made available upon publication of the DeepRVAT manuscript._ + +For users running association testing on UKBB WES data, we provide precomputed burden scores for all protein-coding genes with a qualifying variant within 300 bp of an exon. In this scenario, users are freed from processing of large WES data and may carry out highly computationally efficient association tests with the default DeepRVAT pipeline or the DeepRVAT+REGENIE integration. + +Note that DeepRVAT scores are on a scale between 0 and 1, with a score closer to 0 indicating that the aggregate effect of variants in the gene is protective, and a score closer to 1 when the aggregate effect is deleterious. + +### Pretrained models + +Some users may wish to select variants or make variant-to-gene assignments differently from our methods, or to work on datasets other than UKBB. For this, we provide an ensemble of pretrained DeepRVAT gene impairment modules, which can be used for scoring individual-gene pairs for subsequent association testing. We also provide a pipeline for functional annotation of variants for compatibility with the pretrained modules. + +### Model training + +Other users may wish to exert full control over DeepRVAT scores, for example, to modify the model architecture, the set of annotations, or the set of training traits. For this, we provide pipelines for gene impairment module training, both in our CV and in a standard training/validation setup, with subsequent gene impairment score computation and association testing. + + +## Gene impairment module training + +For users wishing to train a custom DeepRVAT model, we provide here some practical suggestions based on our experiences. + +### Model architecture + +We found no benefit to using architectures larger than that used in this work, though we conjecture that larger architectures may provide some benefit with larger training data and more annotations. We performed limited experimentation with the aggregation function used and found the maximum to give better results than the sum. However, exploring other choices or a learned aggregation remains open. + +### Training traits and seed genes + +We found that multiphenotype training improved performance, however, on our dataset, adding traits with fewer than three seed genes provided modest to no benefit. We also saw poor performance when including seed genes based on prior knowledge, e.g., known GWAS or RVAS associations, rather than the seed gene discovery methods. We hypothesize that this is because an informative seed gene must have driver rare variants in the training dataset itself, which may not be the case for associations known from other cohorts. + +### Variant selection + +While association testing was carried out on variants with MAF < 0.1%, we saw improved results when including a greater number of variants (we used MAF < 1%) for training. + +### Variant annotations + +We found that the best performance was achieved when including the full set of annotations, including correlated annotations. We thus recommend including annotations fairly liberally. However, we did find limits, for example, increasing the number of DeepSEA PCs from the 6 we used provided no benefit and eventually degraded model performance. + +### Model ensembling + +We found little to no benefit, but also no harm, from using more than 6 DeepRVAT gene impairment modules per CV fold in our ensemble. Therefore, we chose this number as the most computationally efficient to achieve optimal results. + diff --git a/docs/preprocessing.md b/docs/preprocessing.md index c9c3bc79..0457d9f8 100644 --- a/docs/preprocessing.md +++ b/docs/preprocessing.md @@ -1,7 +1,7 @@ -# DeepRVAT Preprocessing pipeline +# DeepRVAT preprocessing pipeline -The DeepRVAT preprocessing pipeline is based on [snakemake](https://snakemake.readthedocs.io/en/stable/) it uses -[bcftools+samstools](https://www.htslib.org/) and a [python script](https://github.com/PMBio/deeprvat/blob/main/deeprvat/preprocessing/preprocess.py) preprocessing.py. +The DeepRVAT preprocessing pipeline is based on [snakemake](https://snakemake.readthedocs.io/en/stable/). It uses +[bcftools+samtools](https://www.htslib.org/) and a [python script](https://github.com/PMBio/deeprvat/blob/main/deeprvat/preprocessing/preprocess.py) preprocessing.py. ![DeepRVAT preprocessing pipeline](_static/preprocess_no_qc_rulegraph.svg) diff --git a/docs/quickstart.md b/docs/quickstart.md new file mode 100644 index 00000000..cbc798f1 --- /dev/null +++ b/docs/quickstart.md @@ -0,0 +1,70 @@ +# Basic usage + +## Install the package + +Instructions [here](installation.md) + +## Customize pipelines + +Before running any of the snakefiles, you may want to adjust the number of threads used by different steps in the pipeline. To do this, modify the `threads:` property of a given rule. + +If you are running on a computing cluster, you will need a [profile](https://github.com/snakemake-profiles) and may need to add `resources:` directives to the snakefiles. + + +## Run the preprocessing pipeline on your VCF files + +Instructions [here](preprocessing.md) + + +## Annotate variants + +Instructions [here](annotations.md) + + +## Example DeepRVAT runs + +In each case, replace `[path_to_deeprvat]` with the path to your clone of the repository. + +Note that the example data used here is randomly generated, and so is only suited for testing whether the `deeprvat` package has been correctly installed. + + +### Run the association testing pipeline with pretrained models + +```shell +mkdir deeprvat_associate +cd deeprvat_associate +ln -s [path_to_deeprvat]/example/* . +ln -s [path_to_deeprvat]/pretrained_models +snakemake -j 1 --snakefile [path_to_deeprvat]/pipelines/association_testing_pretrained.snakefile +``` + + +### Run association testing using REGENIE on precomputed burdens + +```shell +mkdir deeprvat_associate_regenie +cd deeprvat_associate_regenie +ln -s [path_to_deeprvat]/example/* . +ln -s precomputed_burdens/burdens.zarr . +snakemake -j 1 --snakefile [path_to_deeprvat]/pipelines/association_testing_pretrained_regenie.snakefile +``` + + +### Run the training pipeline on some example data + +```shell +mkdir deeprvat_train +cd deeprvat_train +ln -s [path_to_deeprvat]/example/* . +snakemake -j 1 --snakefile [path_to_deeprvat]/pipelines/run_training.snakefile +``` + + +### Run the full training and association testing pipeline on some example data + +```shell +mkdir deeprvat_train_associate +cd deeprvat_train_associate +ln -s [path_to_deeprvat]/example/* . +snakemake -j 1 --snakefile [path_to_deeprvat]/pipelines/training_association_testing.snakefile +``` diff --git a/docs/seed_gene_discovery.md b/docs/seed_gene_discovery.md index c58183ab..3fb8ca97 100644 --- a/docs/seed_gene_discovery.md +++ b/docs/seed_gene_discovery.md @@ -4,34 +4,65 @@ This pipeline discovers *seed genes* for DeepRVAT training. The pipeline runs SK To run the pipeline, an experiment directory with the `config.yaml` has to be created. +(input-data)= ## Input data -The experiment directory in addition requires to have the same input data as specified for [DeepRVAT](usage.md), including +The experiment directory in addition requires to have the same input data as specified for [DeepRVAT](deeprvat.md), including - `annotations.parquet` - `protein_coding_genes.parquet` - `genotypes.h5` - `variants.parquet` - `phenotypes.parquet` +- `config.yaml` (use [this](https://github.com/PMBio/deeprvat/blob/main/deeprvat/seed_gene_discovery/config.yaml) as a template) -The `annotations.parquet` data frame should have the following columns: +The `annotations.parquet` dataframe, generated by the annotation pipeline, can be utilized. To indicate if a variant is a loss of function (pLoF) variant, a column `is_plof` has to be added with values 0 or 1. We recommend to set this to `1` if the variant has been classified as any of these VEP consequences: `["splice_acceptor_variant", "splice_donor_variant", "frameshift_variant", "stop_gained", "stop_lost", "start_lost"]`. -- id (variant id, **should be the index column**) -- gene_ids (list) gene(s) the variant is assigned to -- is_plof (binary, indicating if the variant is loss of function) -- Consequence_missense_variant: -- MAF: Maximum of the MAF in the UK Biobank cohort and in gnomAD release 3.0 (non-Finnish European population) can also be changed by using the --maf-column {maf_col_name} flag for the rule config and replacing MAF in the config.yaml with the {maf_col_name} but it must contain the string '_AF', '_MAF' OR '^MAF' +(configuration-file)= +## Configuration file -### Run the seed gene discovery pipeline with example data +You can restrict to only missense variants (identified by the `Consequence_missense_variant` column in `annotations.parquet` ) or pLoF variants (`is_plof` column) via +``` +variant_types: + - missense + - plof +``` +and specify the test types that will be run via +``` +test_types: + - skat + - burden +``` + +The minor allele frequency threshold is set via + +``` +rare_maf: 0.001 +``` + +You can specify further test details in the test config using the following parameters: + +- `center_genotype` center the genotype matrix (True or False) +- `neglect_homozygous` Should the genotype value for homozyoogus variants be 1 (True) or 2 (False) +- `collapse_method` Burden test collapsing method. Supported are `sum` and `max` +- `var_weight` Variant weighting function. Supported are `beta_maf` (Beta(MAF, 1, 25)) or `sift_polpyen` (mean of 1-SIFT and Polyphen2 score) +- `min_mac` minimum expected allele count for genes to be included. This is the cumulative allele frequency of variants in the burden mask (e.g., pLoF variants) for a given gene (e.g. pLoF variants) multiplied by the cohort size or number of cases for quantitative and binary traits, respectively. + +``` +test_config: + center_genotype: True + neglect_homozygous: False + collapse_method: sum #collapsing method for burden, + var_weight_function: beta_maf + min_mac: 50 # minimum expected allel count + +``` -Create the conda environment and activate it, (instructions can be found here [DeepRVAT instructions](usage.md) ) +## Running the seed gene discovery pipeline +In a directory with all the [input data](#input-data) required and your [configuration file](#configuration-file) set up, run: ``` -mkdir example -cd example -ln -s [path_to_deeprvat]/example/* . -cp [path_to_deeprvat]/deeprvat/seed_gene_discovery/config.yaml . -snakemake -j 1 --snakefile [path_to_deeprvat]/pipelines/seed_gene_discovery.snakefile +[path_to_deeprvat]/pipelines/seed_gene_discovery.snakefile ``` Replace `[path_to_deeprvat]` with the path to your clone of the repository. diff --git a/docs/ukbiobank.md b/docs/ukbiobank.md new file mode 100644 index 00000000..7c4987f3 --- /dev/null +++ b/docs/ukbiobank.md @@ -0,0 +1,9 @@ +# Applying DeepRVAT to UK Biobank data + +_Note: This section is coming soon!_ + +## First steps + +## Basic analysis: Using precomputed burdens + +## Advanced analysis: Custom-trained DeepRVAT model diff --git a/docs/usage.md b/docs/usage.md deleted file mode 100644 index 5d7c9170..00000000 --- a/docs/usage.md +++ /dev/null @@ -1,85 +0,0 @@ -# Using DeepRVAT - -## Installation - -1. Clone this repository: -```shell -git clone git@github.com:PMBio/deeprvat.git -``` -1. Change directory to the repository: `cd deeprvat` -1. Install the conda environment. We recommend using [mamba](https://mamba.readthedocs.io/en/latest/index.html), though you may also replace `mamba` with `conda` - - *note: [the current deeprvat env does not support cuda when installed with conda](https://github.com/PMBio/deeprvat/issues/16), install using mamba for cuda support.* -```shell -mamba env create -n deeprvat -f deeprvat_env.yaml -``` -1. Activate the environment: `mamba activate deeprvat` -1. Install the `deeprvat` package: `pip install -e .` - -If you don't want to install the gpu related requirements use the `deeprvat_env_no_gpu.yml` environment instead. -```shell -mamba env create -n deeprvat -f deeprvat_env_no_gpu.yaml -``` - - -## Basic usage - -### Customize pipelines - -Before running any of the snakefiles, you may want to adjust the number of threads used by different steps in the pipeline. To do this, modify the `threads:` property of a given rule. - -If you are running on a computing cluster, you will need a [profile](https://github.com/snakemake-profiles) and may need to add `resources:` directives to the snakefiles. - - -### Run the preprocessing pipeline on VCF files - -Instructions [here](preprocessing.md) - - -### Annotate variants - -Instructions [here](annotations.md) - - - -### Try the full training and association testing pipeline on some example data - -```shell -mkdir example -cd example -ln -s [path_to_deeprvat]/example/* . -snakemake -j 1 --snakefile [path_to_deeprvat]/pipelines/training_association_testing.snakefile -``` - -Replace `[path_to_deeprvat]` with the path to your clone of the repository. - -Note that the example data is randomly generated, and so is only suited for testing whether the `deeprvat` package has been correctly installed. - - -### Run the training pipeline on some example data - -```shell -mkdir example -cd example -ln -s [path_to_deeprvat]/example/* . -snakemake -j 1 --snakefile [path_to_deeprvat]/pipelines/run_training.snakefile -``` - -Replace `[path_to_deeprvat]` with the path to your clone of the repository. - -Note that the example data is randomly generated, and so is only suited for testing whether the `deeprvat` package has been correctly installed. - - -### Run the association testing pipeline with pretrained models - -```shell -mkdir example -cd example -ln -s [path_to_deeprvat]/example/* . -ln -s [path_to_deeprvat]/pretrained_models -snakemake -j 1 --snakefile [path_to_deeprvat]/pipelines/association_testing_pretrained.snakefile -``` - -Replace `[path_to_deeprvat]` with the path to your clone of the repository. - -Again, note that the example data is randomly generated, and so is only suited for testing whether the `deeprvat` package has been correctly installed. diff --git a/ukbiobank.md b/ukbiobank.md new file mode 100644 index 00000000..07333e8d --- /dev/null +++ b/ukbiobank.md @@ -0,0 +1,9 @@ +# UK Biobank analysis + +This section will be filled with content soon! + +## First steps + +## Basic analysis: Using precomputed burdens + +## Advanced analysis: Custom-trained DeepRVAT model