diff --git a/deeprvat/annotations/README.md b/deeprvat/annotations/README.md deleted file mode 100644 index a19a270f..00000000 --- a/deeprvat/annotations/README.md +++ /dev/null @@ -1,120 +0,0 @@ -# DeepRVAT Annotation pipeline - -This pipeline is based on [snakemake](https://snakemake.readthedocs.io/en/stable/). It uses [bcftools + samstools](https://www.htslib.org/), as well as [perl](https://www.perl.org/), [deepRiPe](https://ohlerlab.mdc-berlin.de/software/DeepRiPe_140/) and [deepSEA](http://deepsea.princeton.edu/) as well as [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html), including plugins for [primateAI](https://github.com/Illumina/PrimateAI) and [spliceAI](https://github.com/Illumina/SpliceAI). DeepRiPe annotations were acquired using [faatpipe repository by HealthML](https://github.com/HealthML/faatpipe)[[1]](#1) and DeepSea annotations were calculated using [kipoi-veff2](https://github.com/kipoi/kipoi-veff2)[[2]](#2), abSplice scores were computet using [abSplice](https://github.com/gagneurlab/absplice/)[[3]](#3) - -![dag](https://github.com/PMBio/deeprvat/assets/23211603/d483831e-3558-4e21-9845-4b62ad4eecc3) -*Figure 1: Example DAG of annoation pipeline using only two bcf files as input.* - -## Input - -The pipeline uses left-normalized bcf files containing variant information, a reference fasta file as well as a text file that maps data blocks to chromosomes as input. It is expected that the bcf files contain the columns "CHROM" "POS" "ID" "REF" and "ALT". Any other columns, including genotype information are stripped from the data before annotation tools are used on the data. The variants may be split into several vcf files for each chromosome and each "block" of data. The filenames should then contain the corresponding chromosome and block number. The pattern of the file names, as well as file structure may be specified in the corresponding [config file](config/deeprvat_annotation_config.yaml). - -## Requirements -BCFtools as well as HTSlib should be installed on the machine, -- [CADD](https://github.com/kircherlab/CADD-scripts/tree/master/src/scripts) as well as -- [VEP](http://www.ensembl.org/info/docs/tools/vep/script/vep_download.html), -- [absplice](https://github.com/gagneurlab/absplice/tree/master), -- [kipoi-veff2](https://github.com/kipoi/kipoi-veff2) -- [faatpipe](https://github.com/HealthML/faatpipe), and the -- [vep-plugins repository](https://github.com/Ensembl/VEP_plugins/) - -will be installed by the pipeline together with the [plugins](https://www.ensembl.org/info/docs/tools/vep/script/vep_plugins.html) for primateAI and spliceAI. Annotation data for CADD, spliceAI and primateAI should be downloaded. The path to the data may be specified in the corresponding [config file](config/deeprvat_annotation_config.yaml). -Download path: -- [CADD](http://cadd.gs.washington.edu/download): "All possible SNVs of GRCh38/hg38" and "gnomad.genomes.r3.0.indel.tsv.gz" incl. their Tabix Indices -- [SpliceAI](https://basespace.illumina.com/s/otSPW8hnhaZR): "genome_scores_v1.3"/"spliceai_scores.raw.snv.hg38.vcf.gz" and "spliceai_scores.raw.indel.hg38.vcf.gz" -- [PrimateAI](https://basespace.illumina.com/s/yYGFdGih1rXL) PrimateAI supplementary data/"PrimateAI_scores_v0.2_GRCh38_sorted.tsv.bgz" - - -## Output - -The pipeline outputs one annotation file for VEP, CADD, DeepRiPe, DeepSea and Absplice for each input vcf-file. The tool further creates concatenated files for each tool and one merged file containing Scores from AbSplice, VEP incl. CADD, primateAI and spliceAI as well as principal components from DeepSea and DeepRiPe. - -## Configure the annotation pipeline -The snakemake annotation pipeline is configured using a yaml file with the format akin to the [example file](config/deeprvat_annotation_config.yaml). - -The config above would use the following directory structure: -```shell - -|-- reference -| |-- fasta file - - -|-- metadata -| |-- pvcf_blocks.txt - -|-- preprocessing_workdir -| |--reference -| | |-- fasta file -| |-- norm -| | |-- bcf -| | | |-- bcf_input_files -| | | |-- ... -| | |-- variants -| | | |-- variants.tsv.gz - -|-- output_dir -| |-- annotations -| | |-- tmp - -|-- repo_dir -| |-- ensembl-vep -| | |-- cache -| | |-- plugins -| |-- abSplice -| |-- faatpipe -| |-- kipoi-veff2 - -|-- annotation_data -| |-- cadd -| |-- spliceAI -| |-- primateAI - - - -``` - -Bcf files created by the [preprocessing pipeline](https://github.com/PMBio/deeprvat/blob/Annotations/deeprvat/preprocessing/README.md) are used as input data. -The pipeline also uses the variant.tsv file as well as the reference file from the preprocesing pipeline. -The pipeline beginns by installing the repositories needed for the annotations, it will automatically install all repositories in the `repo_dir` folder that can be specified in the config file relative to the annotation working directory. -The text file mapping blocks to chromosomes is stored in `metadata` folder. The output is stored in the `output_dir/annotations` folder and any temporary files in the `tmp` subfolder. All repositories used including VEP with its corresponding cache as well as plugins are stored in `repo_dir/ensempl-vep`. -Data for VEP plugins and the CADD cache are stored in `annotation data`. - -## Running the annotation pipeline -### Preconfiguration -- Inside the annotation directory create a directory `repo_dir` and run the [annotation setup script](setup_annotation_workflow.sh) - ```shell - setup_annotation_workflow.sh repo_dir/ensembl-vep/cache repo_dir/ensembl-vep/Plugins repo_dir - ``` - or manually clone the repositories mentioned in the [requirements](#requirements) into `repo_dir` and install the needed conda environments with - ```shell - mamba env create -f repo_dir/absplice/environment.yaml - mamba env create -f repo_dir/kipoi-veff2/environment.minimal.linux.yml - mamba env create -f deeprvat/deeprvat_annotations.yml - ``` - If you already have some of the needed repositories on your machine you can edit the paths in the [config](../../pipelines/config/deeprvat_annotation_config.yaml). - - -- Inside the annotation directory create a directory `annotation_dir` and download/link the prescored files for CADD, SpliceAI, and PrimateAI (see [requirements](#requirements)) - - -### Running the pipeline -After configuration and activating the `deeprvat_annotations` environment run the pipeline using snakemake: - -```shell - snakemake -j -s annotations.snakemake --configfile config/deeprvat_annotation.config --use-conda -``` -## Running the annotation pipeline without the preprocessing pipeline - -It is possible to run the annotation pipeline without having run the preprocessing prior to that. -However, the annotation pipeline requires some files from this pipeline that then have to be created manually. -- Left normalized bcf files from the input. These files do not have to contain any genotype information. "chrom, "pos", "ref" and "alt" columns will suffice. -- a reference fasta file will have to be provided -- A tab separated file containing all input variants "chrom, "pos", "ref" and "alt" entries each with a unique id. - - -## References -[1] Monti, R., Rautenstrauch, P., Ghanbari, M. et al. Identifying interpretable gene-biomarker associations with functionally informed kernel-based tests in 190,000 exomes. Nat Commun 13, 5332 (2022). https://doi.org/10.1038/s41467-022-32864-2 - -[2] Žiga Avsec et al., “Kipoi: accelerating the community exchange and reuse of predictive models for genomics,” bioRxiv, p. 375345, Jan. 2018, doi: 10.1101/375345. - -[3]N. Wagner et al., “Aberrant splicing prediction across human tissues,” Nature Genetics, vol. 55, no. 5, pp. 861–870, May 2023, doi: 10.1038/s41588-023-01373-3. diff --git a/deeprvat/preprocessing/README.md b/deeprvat/preprocessing/README.md deleted file mode 100644 index d499a6b6..00000000 --- a/deeprvat/preprocessing/README.md +++ /dev/null @@ -1,166 +0,0 @@ -# DeepRVAT Preprocessing pipeline - -The DeepRVAT preprocessing pipeline is based on [snakemake](https://snakemake.readthedocs.io/en/stable/) it uses -[bcftools+samstools](https://www.htslib.org/) and a [python script](preprocess.py) preprocessing.py. - -![DeepRVAT preprocessing pipeline](./preprocess_rulegraph.svg) - -## Output - -The important files that this pipeline produces that are needed in DeepRVAT are: - -- **preprocessed/genotypes.h5** *The main sparse hdf5 file* - -- **norm/variants/variants.parquet** *List of variants i parquet format* - -## Setup environment - -Create the DeepRVAT processing environment - -Clone this repository: - -```shell -git clone git@github.com:PMBio/deeprvat.git -``` - -Change directory to the repository: `cd deeprvat` - -```shell -mamba env create --file deeprvat_preprocessing_env.yml -``` - -Activate the environment - -```shell -mamba activate deeprvat_preprocess -``` - -Install DeepRVAT in the environment - -```shell -pip install -e . -``` - -## Configure preprocessing - -The snakemake preprocessing is configured using a yaml file with the format below. -An example file is included in this repo: [example config](config/deeprvat_preprocess_config.yaml). - -```yaml -# What chromosomes should be processed -included_chromosomes: [ 20,21,22 ] - -# If you need to run a cmd to load bcf and samtools specify it here -bcftools_load_cmd: module load bcftools/1.10.2 && -samtools_load_cmd: module load samtools/1.9 && - -# Path to where you want to write results and intermediate data -working_dir: /workdir -# Path to ukbb data -data_dir: /data - -# These paths are all relative to the data dir -input_vcf_dir_name: vcf -metadata_dir_name: metadata - -# expected to be found in the data_dir / metadata_dir -pvcf_blocks_file: pvcf_blocks.txt - -# These paths are all relative to the working dir -# Here will the finished preprocessed files end up -preprocessed_dir_name: preprocesed -# Path to directory with fasta reference file -reference_dir_name: reference -# Here we will store normalized bcf files -norm_dir_name: norm -# Here we store "sparsified" bcf files -sparse_dir_name: sparse - -# Expected to be found in working_dir/reference_dir -reference_fasta_file: GRCh38_full_analysis_set_plus_decoy_hla.fa - -# The format of the name of the "raw" vcf files -vcf_filename_pattern: ukb23156_c{chr}_b{block}_v1.vcf.gz - -# Number of threads to use in the preprocessing script, separate from snakemake threads -preprocess_threads: 16 - ``` - -The config above would use the following directory structure: - -```shell -parent_directory -|-- data -| |-- metadata -| `-- vcf -`-- workdir - |-- norm - | |-- bcf - | |-- sparse - | `-- variants - |-- preprocesed - |-- qc - | |-- allelic_imbalance - | |-- duplicate_vars - | |-- filtered_samples - | |-- hwe - | |-- indmiss - | | |-- samples - | | |-- sites - | | `-- stats - | |-- read_depth - | `-- varmiss - `-- reference - -``` - -## Running the preprocess pipeline - -### Run the preprocess pipeline with example data - -*The vcf files in the example data folder was generated using [fake-vcf](https://github.com/endast/fake-vcf) (with some -manual editing). -hence does not contain real data.* - -1. cd into the preprocessing example dir - -```shell -cd -cd example/preprocess -``` - -2. Download the fasta file - -```shell -wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/GRCh38.primary_assembly.genome.fa.gz -P workdir/reference -``` - -3. Unpack the fasta file - -```shell -gzip -d workdir/reference/GRCh38.primary_assembly.genome.fa.gz -``` - -4. Run with the example config - -```shell -snakemake -j 1 --snakefile ../../pipelines/preprocess.snakefile --configfile ../../pipelines/config/deeprvat_preprocess_config.yaml -``` - -5. Enjoy the preprocessed data 🎉 - -```shell -ls -l workdir/preprocesed -total 48 --rw-r--r-- 1 user staff 6404 Aug 2 14:06 genotypes.h5 --rw-r--r-- 1 user staff 6354 Aug 2 14:06 genotypes_chr21.h5 --rw-r--r-- 1 user staff 6354 Aug 2 14:06 genotypes_chr22.h5 -``` - -### Run on your own data - -After configuration and activating the environment run the pipeline using snakemake: - -```shell -snakemake -j --configfile config/deeprvat_preprocess_config.yaml -s preprocess.snakefile -``` diff --git a/deeprvat/preprocessing/preprocess_rulegraph.svg b/deeprvat/preprocessing/preprocess_rulegraph.svg deleted file mode 100644 index 22930d7a..00000000 --- a/deeprvat/preprocessing/preprocess_rulegraph.svg +++ /dev/null @@ -1,253 +0,0 @@ - - - - - - -snakemake_dag - - - -0 - -all - - - -1 - -combine_genotypes - - - -1->0 - - - - - -2 - -preprocess - - - -2->1 - - - - - -3 - -add_variant_ids - - - -3->0 - - - - - -3->2 - - - - - -4 - -concatenate_variants - - - -4->3 - - - - - -9 - -create_parquet_variant_ids - - - -4->9 - - - - - -5 - -variants - - - -5->4 - - - - - -6 - -normalize - - - -6->5 - - - - - -10 - -sparsify - - - -6->10 - - - - - -11 - -qc_varmiss - - - -6->11 - - - - - -12 - -qc_hwe - - - -6->12 - - - - - -13 - -qc_read_depth - - - -6->13 - - - - - -14 - -qc_allelic_imbalance - - - -6->14 - - - - - -7 - -extract_samples - - - -7->2 - - - - - -7->6 - - - - - -8 - -index_fasta - - - -8->6 - - - - - -9->0 - - - - - -9->2 - - - - - -10->2 - - - - - -11->2 - - - - - -12->2 - - - - - -13->2 - - - - - -14->2 - - - - - -15 - -create_excluded_samples_dir - - - -15->2 - - - - - diff --git a/deeprvat/seed_gene_discovery/README.md b/deeprvat/seed_gene_discovery/README.md deleted file mode 100644 index 1ae78f4a..00000000 --- a/deeprvat/seed_gene_discovery/README.md +++ /dev/null @@ -1,38 +0,0 @@ -# Seed gene discovery - -This pipeline discovers *seed genes* for DeepRVAT training. The pipeline runs SKAT and burden tests for missense and pLOF variants, weighting variants with Beta(MAF,1,25). To run the tests, we use the `Scoretest` from the [SEAK](https://github.com/HealthML/seak) package (has to be installed from github). - -To run the pipeline, an experiment directory with the `config.yaml` has to be created. An `lsf.yaml` file specifiying the compute resources for each rule in `seed_gene_discovery.snakefile` might also be needed depending on your system (see as an example the `lsf.yaml` file in this directory). - -## Input data - -The experiment directory in addition requires to have the same input data as specified for [DeepRVAT](https://github.com/PMBio/deeprvat/tree/main/README.md), including -- `annotations.parquet` -- `protein_coding_genes.parquet` -- `genotypes.h5` -- `variants.parquet` -- `phenotypes.parquet` - -The `annotations.parquet` data frame should have the following columns: - -- id (variant id, **should be the index column**) -- gene_ids (list) gene(s) the variant is assigned to -- is_plof (binary, indicating if the variant is loss of function) -- Consequence_missense_variant: -- MAF: Maximum of the MAF in the UK Biobank cohort and in gnomAD release 3.0 (non-Finnish European population) can also be changed by using the --maf-column {maf_col_name} flag for the rule config and replacing MAF in the config.yaml with the {maf_col_name} but it must contain the string '_AF', '_MAF' OR '^MAF' - -### Run the seed gene discovery pipeline with example data - -Create the conda environment and activate it, (instructions can be found in the [DeepRVAT README](https://github.com/PMBio/deeprvat/tree/main/README.md) ) - - -``` -mkdir example -cd example -ln -s [path_to_deeprvat]/example/* . -cp [path_to_deeprvat]/deeprvat/seed_gene_discovery/config.yaml . -snakemake -j 1 --snakefile [path_to_deeprvat]/pipelines/seed_gene_discovery.snakefile -``` - -Replace `[path_to_deeprvat]` with the path to your clone of the repository. -